On Wed, Dec 4, 2013 at 12:08 PM, Jeff King <peff@xxxxxxxx> wrote: > On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote: > >> >> - better integration with git bundles, provide a way to seamlessly >> >> create/fetch/resume the bundles with "git clone" and "git fetch" >> >> We have been thinking about formalizing the /clone.bundle hack used by >> repo on Android. If the server has the bundle, add a capability in the >> refs advertisement saying its available, and the clone client can >> first fetch $URL/clone.bundle. > > Yes, that was going to be my next step after getting the bundle fetch > support in. Yay! > If we are going to do this, though, I'd really love for it > to not be "hey, fetch .../clone.bundle from me", but a full-fledged > "here are full URLs of my mirrors". Ack. I agree completely. > Then you can redirect a non-http cloner to http to grab the bundle. Or > redirect them to a CDN. Or even somebody else's server entirely (e.g., > "go fetch from Linus first, my piddly server cannot feed you the whole > kernel"). Some of the redirects you can do by issuing an http redirect > to "/clone.bundle", but the cross-protocol ones are tricky. Ack. My thoughts exactly. Especially the part of "my piddly server shouldn't have to serve you a clone of Linus' tree when there are many public hosts mirroring his code available to anyone". It is simply not fair to clone Linus' tree off some guy's home ADSL connection, his uplink probably sucks. But it is reasonable to fetch his incremental delta after cloning from some other well known and well connected source. > If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors") > it does not add much overhead over a simple capability. There are a few > extra round trips to actually fetch the blob (client sends a want and no > haves, then server sends the pack), but I think that's negligible when > we are talking about redirecting a full clone. In either case, we have > to hang up the original connection, fetch the mirror, and then come > back. I wasn't thinking about using a "well known blob" for this. Jonathan, Dave, Colby and I were kicking this idea around on Monday during lunch. If the initial ref advertisement included a "mirrors" capability the client could respond with "want mirrors" instead of the usual want/have negotiation. The server could then return the mirror URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial compared to the cost to really clone the repository. These pkt-lines need to be a bit more than just URL. Or we need a new URL like "bundle:http://...." to denote a resumable bundle over HTTP vs. a normal HTTP URL that might not be a bundle file, and is just a better connected server. The mirror URLs could be stored in $GIT_DIR/config as a simple multi-value variable. Unfortunately that isn't easily remotely editable. But I am not sure I care? GitHub doesn't let you edit $GIT_DIR/config, but it doesn't need to. For most repositories hosted at GitHub, GitHub is probably the best connected server for that repository. For repositories that are incredibly high traffic GitHub might out of its own interest want to configure mirror URLs on some sort of CDN to distribute the network traffic closer to the edges. Repository owners just shouldn't have to worry about these sorts of details. It should be managed by the hosting service. In my case for android.googlesource.com we want bundles on the CDN near the network edges, and our repository owners don't care to know the details of that. They just want our server software to make it all happen, and our servers already manage $GIT_DIR/config for them. It also mostly manages /clone.bundle on the CDN. And /clone.bundle is an ugly, limited hack. For the average home user sharing their working repository over git:// from their home ADSL or cable connection, editing .git/config is easier than a blob in refs/mirrors. They already know how to edit .git/config to manage remotes. Heck, remote.origin.url might already be a good mirror address to advertise, especially if the client isn't on the same /24 as the server and the remote.origin.url is something like "git.kernel.org". :-) >> For most Git repositories the bundle can be constructed by saving the >> bundle reference header into a file, e.g. >> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is >> created. The bundle can be served by combining the .bh and .pack >> streams onto the network. It is very little additional disk overhead >> for the origin server, > > That's clever. It does not work out of the box if you are using > alternates, but I think it could be adapted in certain situations. E.g., > if you layer the pack so that one "base" repo always has its full pack > at the start, which is something we're already doing at GitHub. Yes, well, I was assuming the pack was a fully connected repack. Alternates always creates a partial pack. But if you have an alternate, that alternate maybe should be given as a mirror URL? And allow the client to recurse the alternate mirror URL list too? By listing the alternate as a mirror a client could maybe discover the resumable clone bundle in the alternate, grab that first to bootstrap, reducing the amount it has to obtain in a non-resumable way. Or... the descendant repository could offer its own bundle with the "must have" assertions from the alternate at the time it repacked. So the .bh file would have a number of ^ lines and the bundle was built with a "--not ..." list. >> but allows resumable clone, provided the server has not done a GC. > > As an aside, the current transfer-resuming code in http.c is > questionable. It does not use etags or any sort of invalidation > mechanism, but just assumes hitting the same URL will give the same > bytes. Yea, our lunch conversation eventually reached this part too. repo's /clone.bundle hack is equally stupid and assumes a resume will get the correct data, with no validation. If you resume with the wrong data while inside of the pack stream the pack will be invalid; the SHA-1 trailer won't match. But you won't know until you have downloaded the entire useless file. Resuming a 700M download after the first 10M only to find out the first 10M is mismatched sucks. What really got us worried was the bundle header has no checksums, and a resume in the bundle header from the wrong version could be interesting. > That _usually_ works for dumb fetching of objects and packfiles, > though it is possible for a pack to change representation without > changing name. Yes. And this is why the packfile name algorithm is horribly flawed. I keep saying we should change it to name the pack using the last 20 bytes of the file but ... nobody has written the patch for that? :-) > My bundle patches inherited the same flaw, but it is much worse there, > because your URL may very well just be "clone.bundle" that gets updated > periodically. Yup, you followed the same thing we did in repo, which is horribly wrong. We should try to use ETag if available to safely resume, and we should try to encourage people to use stronger names when pointing to URLs that are resumable, like a bundle on a CDN. If the URL is offered by the server in pkt-lines after the advertisement its easy for the server to return the current CDN URL, and easy for the server to implement enforcement of the URLs being unique. Especially if you manage the CDN automatically; e.g. Android uses tools to build the CDN files and push them out. Its easy for us to ensure these have unique URLs on every push. A bundling server bundling once a day or once a week could simply date stamp each run. >> > I posted patches for this last year. One of the things that I got hung >> > up on was that I spooled the bundle to disk, and then cloned from it. >> > Which meant that you needed twice the disk space for a moment. >> >> I don't think this is a huge concern. In many cases the checked out >> copy of the repository approaches a sizable fraction of the .pack >> itself. If you don't have 2x .pack disk available at clone time you >> may be in trouble anyway as you try to work with the repository post >> clone. > > Yeah, in retrospect I was being stupid to let that hold it up. I'll > revisit the patches (I've rebased them forward over the past year, so it > shouldn't be too bad). I keep prodding Jonathan to work on this too, because I'd really like to get this out of repo and just have it be something git knows how to do. And bigger mirrors like git.kernel.org could do a quick grep/sort/uniq -c through their access logs and periodically bundle up a few repositories that are cloned often. E.g. we all know git.kernel.org should just bundle Linus' repository. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html