Git clonebundles

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Bitbucket recently added support for Mercurial’s clonebundle extension
(http://gregoryszorc.com/blog/2015/10/22/cloning-improvements-in-mercurial-3.6/).
Mercurial’s clone bundles allow the Mercurial client to seed a repository using
a bundle file instead of dynamically generating a bundle for the client.


Mercurial clonebundles?
~~~~~~~~~~~~~~~~~~~~~~~

With Mercurial clonebundles the high level clone sequence looks like this:

1. The command "hg clone URL"  attempts to clone the repository at URL.
2. If a bundle file exists for the repository, the existence of the file
`clonebundles.manifest` causes the server to advertise the `clonebundle`
capability (capabilities lookup is the first command the client issues).
3. In the above case the client then executes the command "clonebundles".
4. The manifest file will be returned.
5. The client then selects a bundle file to download from the list of URLs
advertised in the manifests file, to seed the repository.
6. To update the repository the last step involves fetching the latest changes.


Why is this useful?
~~~~~~~~~~~~~~~~~~~

The fact that clone bundles can be distributed as static files enables us to
use static file servers for bundle distribution. Users have also reported
latency improvements for clone operations of popular Mercurial repositories.
Additionally this significantly reduces the resource usage of clone operations,
as clone operations are reduced to simpler fetches to resolve the delta between
the current repository and the downloaded bundle state.


clonebundles for git?
~~~~~~~~~~~~~~~~~~~~~

We recently looked into how this concept could be translated to git. This is
not a new idea and has been discussed before (more on that later) but our
success with the Mercurial clonebundle rollout prompted us to revisit this
topic.

We believe that bringing a similar concept to git could have the following
benefits:

* Improved clone times for users that clone large git repositories, especially
  if bundle file distribution leverages global CDNs.
* Improved scalability of git for managing large popular repositories.
  Offloading a significant portion of the clone resource usage to CDNs or static
  file hosts.


Our current proof-of-concept to explore this space, closely follows
the approach from Mercurial outlined above.

* An `/info/bundle` path returns a bundle manifest (over HTTP)
* The bundle manifest contains a simple list of URLs with some additional meta
  data that allows the client to select a suitable bundle download URL
* The bundle download URL points to a bundle file generated using `git bundle
  create` including all the relevant refs as a self contained repository seed.
* The client probes the target URL with a `GET` request to $URL/info/bundle and
  downloads the bundle file if present.
* The repository will be created based on the downloaded bundle (downloading a
  static file allows resumable downloads or parallel downloads of chunks if the
  file/web server supports range requests).
* A `git fetch` and the appropriate checkout then updates the "cloned"
  repository to match the latest upstream state.

The proof-of-concept was built as an external binary `git-clone2` that
mimics the behaviour of the `git clone` command, so unfortunately I
can't provide any patches to git to demonstrate the behaviour.


Ultimately our proof-of-concept is built around a few core ideas:

* Re-use the existing bundle format as a single-file, self-contained
repository representation.
* Introduce a bundle manifest (accessible at `$URL/info/bundle`) that allows
  the client to resolve a suitable bundle download URL.
* Teach the `git clone` command to accept and prefer seeding a repository using
  a static bundle file that is advertised in a bundle manifest.
* Re-use as much as possible of the existing commands and in particular the
  `git bundle` machinery to seed the repository and to create the static bundle
  file.
* We accept additional storage requirements for the bundle files in addition to
  the actual repository content in pack-files or loose objects.
Hosting providers
  or system administrators are free to decide how many bundles to advertise and
  how frequently the bundles are updated.
* It targets the "seed from a bundle file" use case, with resumable clones just
  being a potential side-effect.


Some of the problems that need to be solved with an approach like this are:

* Bundle advertisement/bundle negotiation: We considered advertising a
  new capability "clonebundle" as part of the rev advertisement
capabilities list.
  This would allow clients that support clonebundles to abort the clone attempt
  and resolve a suitable bundle URL from a bundle manifest at `$URL/info/bundle`
  instead. For HTTP this would amount to an early termination when
retrieving the
  ref-advertisement.
  Note: We didn't pursue this for our proof-of-concept so we didn't
explore whether
  this is feasible.
* Uniform approach for the supported transports: Our proof-of-concept
only supports HTTP as
  a transport. Ideally the clonebundle capability could be supported by all
  available transports (of which at least ssh would be highly desirable).
* Bundle manifest and bundle download: It is unclear whose responsibility it is
  to generate the bundle manifest with the bundle download URLs. Most likely the
  bundle files will be served using a webserver or CDN, so download
URL generation
  should not be a core git responsibility. For hosting purpose we envision that
  the bundle manifest might contain dynamic download URLs with personalised
  access tokens with expiry.
* Bundle generation: Similar to the above it is unclear how bundle
generation is handled.
  For hosting purposes, the operator would likely want to influence
when and how bundles are generated.



Prior art
~~~~~~~~~

Our proof-of-concept is built on top of ideas that have been
circulating for a while. We are aware of a number of proposed changes
in this space:


* Jeff King's work on network bundles:
https://github.com/peff/git/commit/17e2409df37edd0c49ef7d35f47a7695f9608900
* Nguyễn Thái Ngọc Duy's work on "[PATCH 0/8] Resumable clone
revisited, proof of concept":
https://www.spinics.net/lists/git/msg267260.html
* Resumable clone work by Kevin Wern:
https://public-inbox.org/git/1473984742-12516-1-git-send-email-kevin.m.wern@xxxxxxxxx/


Whilst the above mentioned proposals/proposed changes are in a similar
space, I would be interest to understand whether there is any
consensus on the general idea of supporting static bundle files as a
mechanism to seed a repository?
I would also appreciate any pointers to other discussions in this area.


Best regards,
Stefan Saasen & Erik van Zijst; Atlassian Bitbucket




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]