On Wed, Nov 2, 2011 at 19:42, Jeff King <peff@xxxxxxxx> wrote: > On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote: > > I thought of doing something like that, but I wanted to be able to make > cross-domain links. The "302 to a CDN" thing is a clever hack, but it > requires more control of the webserver than some users might have. And > of course it doesn't work for the "redirect to git:// on a different > server" trick. Or redirect from "git://". I agree. Later I said I regret this being a bundle file. I also regret it being this $URL/clone.bundle thing. Its a reasonable quick hack in Python for repo. Its cheap for my servers to respond 404 Not Found or 302 Found, and cheap to use the CDN. But it isn't the right solution for git-core. It has given us some useful information already in the context of android.googlesource.com. It appears to work quite well for distributing the large Android operating system. So the notion of making packs available from another URL than the main repository, and doing it as primarily a pack and not the native Git protocol, with a follow-up incremental fetch to bring the client current seems to work. :-) > My thought of having it in "refs/mirrors" is only slightly less hacky, > but I think covers all of those cases. :) Right, but this would have been a bit more work for me to code in Python. :-) Long term this may be a better approach, because it does allow the user to control the redirect without having full control over their HTTP server. It also supports redirections across protocols like you noted above. So its probably the direction we will see git-core take. >> Actually, I really think the bundle thing is wasteful.... sendfile() capability. > > I didn't quite parse this. You say it is wasteful, but then indicate > that it can use sendfile(), which is a good thing. Apparently I was babbling. Based on what else you say, we agree. That is good enough for me. > However, I do agree with this: > >> Its also expensive for kernel.org to create each Git repository twice >> on disk. The disk is cheap. Its the kernel buffer cache that is damned >> expensive. > > Doubling the disk cache required is evil and ugly. I was hoping it > wouldn't matter because the bundle would be hosted on some far-away CDN > server anyway, though. But that is highly dependent on your setup. And > it's really just glossing over the fact that you have twice as many > servers. ;) Right. :-) In my opinion this is the important part. We shouldn't double the disk usage required to support this. Most users can't afford the extra disk cache or the extra server required to make this work well. But they can use sendfile() on the server they have and get a lot of improvement in clone speed due to lower system load, plus resumable clone for the relatively stable history part. > Another issue with packs is that they generally aren't supposed to be > --thin on disk, whereas bundles can be. So I could point you to a > succession of bundles. Which is maybe a feature, or maybe just makes > things insanely complex[1]. Actually we can store --thin on disk safely. Don't laugh until you finish reading it through. To build an incremental pack we modify pack-objects to construct a completed thin pack on disk. Build up the list of objects that you want in the thin pack, as though it were thin. Use REF_DELTA format to reference objects that are not in this set but are delta bases. Copy the necessary delta bases from the base pack over to the thin pack, at the end just like it would be if received over the wire. The pack is now self-contained like its supposed to be, but the tail of it is redundant information. If you cache alongside of the pack the "thin" object count, the cut offset of the thin vs. completed bases, and the SHA-1 of the "thin" pack, you can serve the "thin" pack by copying the header, then the region of the file up to the cut point, and the final SHA-1. And there are no pack file format changes involved. :-) Obviously this has some downside. Using REF_DELTA instead of OFS_DELTA for the relatively small number of references from the "thin" part to the completed part at the tail isn't a big disk space overhead. The big overhead is storing the boundary data that served as delta bases at the tail of this incremental pack. But we already do that when you transfer this section of data over the network and it was more than 100 objects. So I think we can get away with doing this. The serving repository is in no worse state than if the owner had just pushed all of that incremental stuff into the serving repository and it completed as a thin pack. With only 2 packs in the serving repository (e.g. the historical stuff that is stable, and the incremental current thin pack + completed bases), git gc --auto wouldn't even kick in to GC this thing for a while *anyway*. So we already probably have a ton of repositories in the wild that exhibit this disk layout and space usage, and nobody has complained about it. For a server admin or repository owner who cares about his user's resumable clone support, carrying around a historical pack and a single new incremental pack for say 2-3 months before repacking the entire thing down to 1 new historical pack... the disk space and additional completed base data is an acceptable cost. We already do it. Clients can figure out whether or not they should use an incremental pack download vs the native Git protocol if the incremental pack does like a bundle does and stores the base information alongside of it. Actually you don't want the base (the ^ lines in a bundle), but the immediate child of those. If the client has any of those children, there is some chance the client has other objects in the pack and should favor native protocol. But if the client has none of those base children, but does have the base, it may be more efficient to download the entire pack to bring the client current. The problem with incremental pack updates is balancing the number of round-trip requests against the update rate of the repository against the polling frequency of the client. Its not an easy thing to solve. However, we may be able to do better if the server can do a reasonably fast concat of these thin pack slices together by writing a new object header and computing the SHA-1 trailer as it goes. Instead of computing actual graph connectivity, just concat packs together between the base children and the requested tips. This probably requires that the client ask for every branch (e.g. the typical refs/heads/*:refs/remotes/origin/* refspec) and that branches didn't rewind. But I think this is so common its perhaps worthwhile to look into optimizing. But note we can do this in the native protocol at the server side without telling the client anything, or changing the protocol. It just isn't resumable without a bit more glue to have a state marker available to the client. Nor does it work on a CDN without giving the client more information. :-) > So the sendfile() stuff would always happen over http. I'm OK with that. I was just saying we may be able to also support sendfile() over git:// if the repository owner / git-daemon owner wants us to. Or if not sendfile(), a simple read-write loop that doesn't have to look at the data, since the client will validate it all. > Yeah, I'm liking that idea. In reference to my [1] above, what I've > started with is making: > > git fetch http://host/foo.bundle This should work, whether or not we use it for resumable clone. Its just nice to have that tiny bit of extra glue to make it easy to pull a bundle. So I'd like this too. :-) > Pulling down a non-thin packfile makes the problem go away. We can spool > it right into objects/pack, index it on the fly, and if all is well, > move it into its final filename. If the transfer is interrupted, you > drop what's been indexed so far, finish the transfer, and then re-start > the indexing from scratch (actually, the "on the fly" would probably > involve teaching index-pack to be clever about incrementally reading a > partially written file, but it should be possible). I wonder if we can teach index-pack to work with a thin pack on disk and complete that by appending to the file, in addition to the streaming from stdin it supports. Seems like that should be possible. So then you could save a thin pack to a temp file on disk, and thus could split a bundle header from its pack content, saving them into two different temp files, allowing index-pack to avoid copying the pack portion if its non-thin, or if its a huge thin pack. I did think about doing this in "repo" and decided it was complex, and not worth the effort. So we spool. 2G+ bundles. Its not the most pleasant user experience. If I had more time, I would have tried to split the bundle header from the pack and written the pack directly off for index-pack to read from disk. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html