On Wed, Nov 2, 2011 at 16:27, Jeff King <peff@xxxxxxxx> wrote: > On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote: >> Jeff King <peff@xxxxxxxx> writes: >> >> > Which is all a roundabout way of saying that the git protocol is really >> > the sane way to do efficient transfers. An alternative, much simpler >> > scheme would be for the server to just say: >> > >> > - if you have nothing, then prime with URL http://host/bundle >> > >> > And then _only_ clone would bother with checking mirrors. People doing >> > fetch would be expected to do it often enough that not being resumable >> > isn't a big deal. >> >> I think that is a sensible place to start. Yup, I agree. The "repo" tool used by Android does this in Python right now[1]. Its a simple hack, if the protocol is HTTP or HTTPS the client first tries to download $URL/clone.bundle. My servers have rules that trap on */clone.bundle and issue an HTTP 302 Found response to direct the client to a CDN. Works. :-) [1] http://code.google.com/p/git-repo/source/detail?r=f322b9abb4cadc67b991baf6ba1b9f2fbd5d7812&name=stable > OK. That had been my original intent, but somebody (you?) mentioned the > "if you have X" thing at the GitTogether, which got me thinking. > > I don't mind starting slow, as long as we don't paint ourselves into a > corner for future expansion. I'll try to design the data format for > specifying the mirror locations with that extension in mind. Right. Aside from the fact that $URL/clone.bundle is perhaps a bad way to decide on the URL to actually fetch (and isn't supportable over git:// or ssh://)... we should start with the clone case and worry about incremental updates later. > Even if the bundle thing ends up too wasteful, it may still be useful to > offer a "if you don't have X, go see Y" type of mirror when "Y" is > something efficient, like git:// at a faster host (i.e., the "I built 3 > commits on top of Linus" case). Actually, I really think the bundle thing is wasteful. Its a ton of additional disk. Hosts like kernel.org want to use sendfile() when possible to handle bulk transfers. git:// is not efficient for them because we don't have sendfile() capability. Its also expensive for kernel.org to create each Git repository twice on disk. The disk is cheap. Its the kernel buffer cache that is damned expensive. Assume for a minute that Linus' kernel repository is a popular thing to access. If 400M of that history is available in a normal pack file on disk, and again 400M is available as a "clone bundle thingy", kernel.org now has to eat 800M of disk buffer cache for that one Git repository, because both of those files are going to be hot. I think I messed up with "repo" using a Git bundle file as its data source. What we should have done was a bog standard pack file. Then the client can download the pack file into the .git/objects/pack directory and just generate the index, reusing the entire dumb protocol transport logic. It also allows the server to pass out the same file the server retains for the repository itself, and thus makes the disk buffer cache only 400M for Linus' repository. > Agreed. I was really trying to avoid protocol extensions, though, at > least for an initial version. I'd like to see how far we can get doing > the simplest thing. One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs file contain other lines to list reference tips at the time the pack was made. The client just needs the SHA-1s, it doesn't necessarily need the branch names themselves. A client could initialize itself by getting this set of references, creating temporary dummy references at those SHA-1s, and downloading the corresponding pack file, indexing it, then resuming with a normal fetch. Then we wind up with a git:// or ssh:// protocol extension that enables sendfile() on an entire pack, and to provide the matching objects/info/packs data to help a client over git:// or ssh:// initialize off the existing pack files. Obviously there is the existing security feature that over git:// or ssh:// (or even smart HTTP), a deleted or rewound reference stops exposing the content in the repository that isn't reachable from the other reference tips. The repository owner / server administrator will have to make a choice here, either the existing packs are not exposed as available via sendfile() until after GC can be run to rebuild them around the right content set, or they are exposed and the time to expunge/hide an unreferenced object is expanded until the GC completes (rather than being immediate after the reference updates). But either way, I like the idea of coupling the "resumable pack download" to the *existing* pack files, because this is easy to deal with. If you do have a rewind/delete and need to expunge content, users/administrators already know how to run `git gc --expire=now` to accomplish a full erase. Adding another thing with bundle files somewhere else that may or may not contain the data you want to erase and remembering to clean that up is not a good idea. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html