Re: Resumable clone/Gittorrent (again) - stable packs?

Nicolas Pitre <nico@xxxxxxxxxxx> · Thu, 06 Jan 2011 23:33:51 -0500 (EST)

On Fri, 7 Jan 2011, Zenaan Harkness wrote:

> On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> > On Thu, 6 Jan 2011, Zenaan Harkness wrote:
> >
> >> Bittorrent requires some stability around torrent files.
> >>
> >> Can packs be generated deterministically?
> >
> > They _could_, but we do _not_ want to do that.
> >
> > The only thing which is stable in Git is the canonical representation of
> > objects, and the objects they depend on, expressed by their SHA1
> > signature.  Any BitTorrent-alike design for Git must be based on that
> > property and not the packed representation of those objects which is not
> > meant to be stable.
> >
> > If you don't want to design anything and simply reuse current BitTorrent
> > codebase then simply create a Git bundle from some release version and
> > seed that bundle for a sufficiently long period to be worth it.  Then
> > falling back to git fetch in order to bring the repo up to date with the
> > very latest commits should be small and quick.  When that clone gets too
> > big then it's time to start seeding another more up-to-date bundle.
> 
> Thanks guys for the explanations.
> 
> So, we don't _want_ to generate packs deterministically.
> BUT, we _can_ reliably unpack a pack (duh).

Of course.

> So if my configured "canonical upstream" decides on a particular
> compression etc, I (my git client) doesn't care what has been chosen
> by my upstream.

Indeed.  This is like saying: I'm sending you the value 52, but I chose 
to use the representation "24 + 28", while someone else might decide to 
encode that value as "13 * 4" instead.  You still are able to decode it 
to the same result in both cases.

> What is important for torrent-able packs though is stability over some
> time period, no matter what the format.

Hence my suggestion to simply seed a Git bundle over BitTorrent. Bundles 
are files which are designed to be used by completely ad hoc transports 
and you can fetch from them just like if they were a remote repository.

> There's been much talk of caching, invalidating of caches, overlapping
> torrent-packs etc.

And in my humble opinion this is just all crap.  All those suggestions 
are fragile, create administrative issues, eat up server resources, and 
they all are suboptimal in the end. No one ever implemented a working 
prototype so far either.

We don't want caches.  Fundamentally, we do not need any cache.  Caches 
are a pain to administrate on a busy server anyway as they eat disk 
space and they also represent a much bigger security risk compared to a 
read-only operation.

Furthermore, a cache is good only for the common case that everyone 
want.  but with Git, you cannot presume that everyone is at the same 
version locally.  So either you do a custom transfer for each client to 
minimize transfers and caching the result in that case might not benefit 
that many people, or you make the cached data bigger so to cover more 
cases while making the transfer suboptimal.

Finally, we do have a cache already, and that's the existing packs 
themselves.  During a clone, the vast majority of the transferred data 
is streamed without further processing straight of those existing packs 
as we try to reuse as much data as possible from those packs so not to 
recompute/recompress that data all the time.

> In every case, for torrents to work, the P2P'd files must have some
> stability over some time period.
> (If this assumption is incorrect, please clarify, not counting
> every-file-is-a-torrent and every-commit-is-a-torrent.)
> 
> So, torrentable options:
> - torrent per commit
> - torrent per pack
> - torrent per torrent-archive - new file format
> 
> Torrent per commit - too small, too many torrents; we need larger
> p2p-able sizes in general.
> 
> Torrent per pack - packs non-deterministically created, both between
> hosts and even intra-host (libz upgrade, nr_threads change, git pack
> algorithm optimization).
> 
> A new torrent format, if "close enough" to current git pack
> performance (cpu load, threadability, size) is potential for new
> version of git pack file format - we don't want to store two sets of
> pack files on disk, if sensible to not do so; unlikely to happen - I
> can't conceive that a torrentable format would be anything but worse
> than pack files and therefore would be rejected from git master.
> 
> Can we can relax the perceived requirement to deterministically create
> pack files?
> Well, over what time period are pack files stable in a particular git?
> Over what time period do we require stable files for torrenting?
> 
> Can we simply configure our local git to keep specified pack files for
> specified time period?
> And use those for torrent-packs?
> Perhaps the torrent file could have a UseBy date?

Again, this is just too much complexity for so little gain.

Here's what I suggest:

	cd my_project
	BUNDLENAME=my_project_$(date "+%s").gitbundle
	git bundle create $BUNDLENAME --all
	maketorrent-console your_favorite_tracker $BUNDLENAME

Then start seeding that bundle, and upload $BUNDLENAME.torrent as 
bundle.torrent inside my_project.git on your server.

Now... Git clients could be improved to first check for the availability 
of the file "bundle.torrent" on the remote side, either directly in 
my_project.git, or through some Git protocol extension.  Or even better, 
the torrent hash could be stored in a Git ref, such as 
refs/bittorrent/bundle and the client could use that to retrieve the 
bundle.torrent file through some other means.

When the bundle.torrent file is retrieved, then just pull the torrent 
content (and seed it some more to be nice).  Then simply run "git clone" 
using the original arguments but with the obtained bundle instead of the 
original URL.  Then replace the remote URL in .git/config with the 
actual remote URL instead of the bundle file path.  And finally perform 
a "git pull" to bring the new commits that were added to the remote 
repository since the bundle was created.  That final pull will be small 
and quick.

After a while, that final pull will get bigger as the difference between 
the bundled version and the current tip in the remote repository will 
grow.  So every so often, say 3 months, it might be a good idea to 
create a new bundle so that the latest commits are included into it in 
order to make that final pull small and quick again.

Isn't that sufficient?

Nicolas