Hi all, Bitbucket recently added support for Mercurial’s clonebundle extension (http://gregoryszorc.com/blog/2015/10/22/cloning-improvements-in-mercurial-3.6/). Mercurial’s clone bundles allow the Mercurial client to seed a repository using a bundle file instead of dynamically generating a bundle for the client. Mercurial clonebundles? ~~~~~~~~~~~~~~~~~~~~~~~ With Mercurial clonebundles the high level clone sequence looks like this: 1. The command "hg clone URL" attempts to clone the repository at URL. 2. If a bundle file exists for the repository, the existence of the file `clonebundles.manifest` causes the server to advertise the `clonebundle` capability (capabilities lookup is the first command the client issues). 3. In the above case the client then executes the command "clonebundles". 4. The manifest file will be returned. 5. The client then selects a bundle file to download from the list of URLs advertised in the manifests file, to seed the repository. 6. To update the repository the last step involves fetching the latest changes. Why is this useful? ~~~~~~~~~~~~~~~~~~~ The fact that clone bundles can be distributed as static files enables us to use static file servers for bundle distribution. Users have also reported latency improvements for clone operations of popular Mercurial repositories. Additionally this significantly reduces the resource usage of clone operations, as clone operations are reduced to simpler fetches to resolve the delta between the current repository and the downloaded bundle state. clonebundles for git? ~~~~~~~~~~~~~~~~~~~~~ We recently looked into how this concept could be translated to git. This is not a new idea and has been discussed before (more on that later) but our success with the Mercurial clonebundle rollout prompted us to revisit this topic. We believe that bringing a similar concept to git could have the following benefits: * Improved clone times for users that clone large git repositories, especially if bundle file distribution leverages global CDNs. * Improved scalability of git for managing large popular repositories. Offloading a significant portion of the clone resource usage to CDNs or static file hosts. Our current proof-of-concept to explore this space, closely follows the approach from Mercurial outlined above. * An `/info/bundle` path returns a bundle manifest (over HTTP) * The bundle manifest contains a simple list of URLs with some additional meta data that allows the client to select a suitable bundle download URL * The bundle download URL points to a bundle file generated using `git bundle create` including all the relevant refs as a self contained repository seed. * The client probes the target URL with a `GET` request to $URL/info/bundle and downloads the bundle file if present. * The repository will be created based on the downloaded bundle (downloading a static file allows resumable downloads or parallel downloads of chunks if the file/web server supports range requests). * A `git fetch` and the appropriate checkout then updates the "cloned" repository to match the latest upstream state. The proof-of-concept was built as an external binary `git-clone2` that mimics the behaviour of the `git clone` command, so unfortunately I can't provide any patches to git to demonstrate the behaviour. Ultimately our proof-of-concept is built around a few core ideas: * Re-use the existing bundle format as a single-file, self-contained repository representation. * Introduce a bundle manifest (accessible at `$URL/info/bundle`) that allows the client to resolve a suitable bundle download URL. * Teach the `git clone` command to accept and prefer seeding a repository using a static bundle file that is advertised in a bundle manifest. * Re-use as much as possible of the existing commands and in particular the `git bundle` machinery to seed the repository and to create the static bundle file. * We accept additional storage requirements for the bundle files in addition to the actual repository content in pack-files or loose objects. Hosting providers or system administrators are free to decide how many bundles to advertise and how frequently the bundles are updated. * It targets the "seed from a bundle file" use case, with resumable clones just being a potential side-effect. Some of the problems that need to be solved with an approach like this are: * Bundle advertisement/bundle negotiation: We considered advertising a new capability "clonebundle" as part of the rev advertisement capabilities list. This would allow clients that support clonebundles to abort the clone attempt and resolve a suitable bundle URL from a bundle manifest at `$URL/info/bundle` instead. For HTTP this would amount to an early termination when retrieving the ref-advertisement. Note: We didn't pursue this for our proof-of-concept so we didn't explore whether this is feasible. * Uniform approach for the supported transports: Our proof-of-concept only supports HTTP as a transport. Ideally the clonebundle capability could be supported by all available transports (of which at least ssh would be highly desirable). * Bundle manifest and bundle download: It is unclear whose responsibility it is to generate the bundle manifest with the bundle download URLs. Most likely the bundle files will be served using a webserver or CDN, so download URL generation should not be a core git responsibility. For hosting purpose we envision that the bundle manifest might contain dynamic download URLs with personalised access tokens with expiry. * Bundle generation: Similar to the above it is unclear how bundle generation is handled. For hosting purposes, the operator would likely want to influence when and how bundles are generated. Prior art ~~~~~~~~~ Our proof-of-concept is built on top of ideas that have been circulating for a while. We are aware of a number of proposed changes in this space: * Jeff King's work on network bundles: https://github.com/peff/git/commit/17e2409df37edd0c49ef7d35f47a7695f9608900 * Nguyễn Thái Ngọc Duy's work on "[PATCH 0/8] Resumable clone revisited, proof of concept": https://www.spinics.net/lists/git/msg267260.html * Resumable clone work by Kevin Wern: https://public-inbox.org/git/1473984742-12516-1-git-send-email-kevin.m.wern@xxxxxxxxx/ Whilst the above mentioned proposals/proposed changes are in a similar space, I would be interest to understand whether there is any consensus on the general idea of supporting static bundle files as a mechanism to seed a repository? I would also appreciate any pointers to other discussions in this area. Best regards, Stefan Saasen & Erik van Zijst; Atlassian Bitbucket