On Fri, 28 Jan 2011, Shawn Pearce wrote: > On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote: > > On Fri, 28 Jan 2011, Shawn Pearce wrote: > > > >> This started because I was looking for a way to speed up clones coming > >> from a JGit server. Cloning the linux-2.6 repository is painful, > ... > >> Later I realized, we can get rid of that cached list of objects and > >> just use the pack itself. > ... > > Playing my old record again... I know. But pack v4 should solve a big > > part of this enumeration cost. > > I've said the same thing for years myself. As much as it would be > nice to fix some of the decompression costs with pack v2/v3, v2/v3 is > very common in the wild, and a new pack encoding is going to be a > fairly complex thing to get added to C Git. And pack v4 doesn't > eliminate the enumeration, it just makes it faster. Well, you don't necessarily need pack v4 to be widely deployed for people to benefit from it. If it is available on servers such as git.kernel.org then everybody will see their clone requests go faster. Same principle as for the cache packs. And yes it doesn't eliminate the enumeration, but you can't eliminate it entirely either as many other operations do require object enumeration too, and those would be sped up as well. But this is in fact orthogonal to the cache pack concept indeed. > That's what I also liked about my --create-cache flag. Its keeping > the same data we already have, in the same format we already have it > in. We're just making a more explicit statement that everything in > some pack is about as tightly compressed as it ever would be for a > client, and it isn't going to change anytime soon. Thus we might as > well tag it with .keep to prevent repack of mucking with it, and we > can take advantage of this to serve the pack to clients very fast. I do agree on that point. And I like it too. However I'd prefer if the whole thing wasn't created "automatically". It's probably best if the repository administrator decides explicitly what should go in such cached packs according to actual purpose and usage for good commit thresholds and branches. Only a human can make that decision. I'd also recommend _not_ using the ref namespace for that. Let's not mix up branching/tagging with what is effectively a storage implementation issue. Linking the ref namespace with the actual packs they refer to would be highly inelegant if the SHA1 of the pack has to be part of the ref name. Instead, I'd suggest simply listing all the commit tips a cache pack contains in the .keep file directly instead. That would make it much easier to use with the object alternates too as the alternate mechanism points to the object store of a foreign repo and not to its refs. > Over breakfast this morning I made the point to Junio that with the > cached pack and a slight network protocol change (enabled by a > capability of course) we could stop using pkt-line framing when > sending the cached pack part of the stream, and just send the pack > directly down the socket. That changes the clone of a 400 MB project > like linux-2.6 from being a lot of user space stuff, to just being a > sendfile() call for the bulk of the content. I think we can just hand > off the major streaming to the kernel. While this might look like a good idea in theory, did you actually profile it to see if that would make a noticeable difference? The pkt-line framing allows for asynchronous messages to be sent over a sideband, which you wouldn't be able to do anymore until the full 400 MB is received by the remote side. Without concrete performance numbers I'm not convinced it is worth the maintenance cost for creating a deviation in the protocol like this. > (Part of the protocol change > is we would need to use multiple SHA-1 checksums in the stream, so we > don't have to re-checksum the existing cached pack.) ?? I don't follow you here. > I love the idea of some of the concepts in pack v4. I really do. But > this sounds a lot simpler to implement, and it lets us completely > eliminate a massive amount of server processing (even under pack v4 > you still have object enumeration), in exchange for what might be a > few extra MBs on the wire to the client due to slightly less good > deltas and the use of REF_DELTA in the thin pack used for the most > recent objects. I agree. And what I personally like the most is the fact that this can be made transparent to clients using the existing network protocol unchanged. > Plus we can safely do byte range requests for resumable clone within > the cached pack part of the stream. That part I'm not sure of. We are still facing the same old issues here, as some mirrors might have the same commit edges for a cache pack but not necessarily the same packing result, etc. So I'd keep that out of the picture for now. The idea of being able to resume the transfer of a cache pack is good, however I'd make it into a totally separate service outside git-upload-pack where the issue of validating and updating content on both sides can be done efficiently without impacting the upload-pack protocol. There would be more than just the cache pack in play during a typical clone. > And when pack v4 comes along, we > can use this same strategy for an equally large pack v4 pack. Absolutely. Nicolas