On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote: > On Fri, 28 Jan 2011, Shawn Pearce wrote: > >> This started because I was looking for a way to speed up clones coming >> from a JGit server. Cloning the linux-2.6 repository is painful, ... >> Later I realized, we can get rid of that cached list of objects and >> just use the pack itself. ... > Playing my old record again... I know. But pack v4 should solve a big > part of this enumeration cost. I've said the same thing for years myself. As much as it would be nice to fix some of the decompression costs with pack v2/v3, v2/v3 is very common in the wild, and a new pack encoding is going to be a fairly complex thing to get added to C Git. And pack v4 doesn't eliminate the enumeration, it just makes it faster. > So that's the idea. Keep the exact same functionality as we have now, > without any need for cache management, but making the data structure in > a form that should improve object enumeration by some magnitude. That's what I also liked about my --create-cache flag. Its keeping the same data we already have, in the same format we already have it in. We're just making a more explicit statement that everything in some pack is about as tightly compressed as it ever would be for a client, and it isn't going to change anytime soon. Thus we might as well tag it with .keep to prevent repack of mucking with it, and we can take advantage of this to serve the pack to clients very fast. Over breakfast this morning I made the point to Junio that with the cached pack and a slight network protocol change (enabled by a capability of course) we could stop using pkt-line framing when sending the cached pack part of the stream, and just send the pack directly down the socket. That changes the clone of a 400 MB project like linux-2.6 from being a lot of user space stuff, to just being a sendfile() call for the bulk of the content. I think we can just hand off the major streaming to the kernel. (Part of the protocol change is we would need to use multiple SHA-1 checksums in the stream, so we don't have to re-checksum the existing cached pack.) I love the idea of some of the concepts in pack v4. I really do. But this sounds a lot simpler to implement, and it lets us completely eliminate a massive amount of server processing (even under pack v4 you still have object enumeration), in exchange for what might be a few extra MBs on the wire to the client due to slightly less good deltas and the use of REF_DELTA in the thin pack used for the most recent objects. I don't envision this being used on projects smaller than git.git itself, if you can gc --aggressive the whole thing in a minute the cached pack is probably pointless. But if you have 400+ MB, you want that to be network bound, and have almost no CPU impact on the server. Plus we can safely do byte range requests for resumable clone within the cached pack part of the stream. And when pack v4 comes along, we can use this same strategy for an equally large pack v4 pack. -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html