On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@xxxxxxxxxxx> wrote: > On Fri, 28 Jan 2011, Shawn Pearce wrote: > >> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote: >> > On Fri, 28 Jan 2011, Shawn Pearce wrote: >> > >> >> This started because I was looking for a way to speed up clones coming >> >> from a JGit server. Cloning the linux-2.6 repository is painful, Well, scratch the idea in this thread. I think. I retested JGit vs. CGit on an identical linux-2.6 repository. The repository was fully packed, but had two pack files. 362M and 57M, and was created by packing a 1 month old master, marking it .keep, and then repacking -a -d to get most recent last month into another pack. This results in some files that should be delta compressed together being stored whole in the two packs (obviously). The two implementations take the same amount of time to generate the clone. 3m28s / 3m22s for JGit, 3m23s for C Git. The JGit created pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB. I point out this data because improvements made to JGit may show similar improvements to CGit given how close they are in running time. I fully implemented the reuse of a cached pack behind a thin pack idea I was trying to describe in this thread. It saved 1m7s off the JGit running time, but increased the data transfer by 25 MiB. I didn't expect this much of an increase, I honestly expected the thin pack portion to be well, thinner. The issue is the thin pack cannot delta against all of the history, its only delta compressing against the tip of the cached pack. So long-lived side branches that forked off an older part of the history aren't delta compressing well, or at all, and that is significantly bloating the thin pack. (Its also why that "newer" pack is 57M, but should be 14M if correctly combined with the cached pack.) If I were to consider all of the objects in the cached pack as potential delta base candidates for the thin pack, the entire benefit of the cached pack disappears. Which leaves me with dropping this idea. I started it because I was actually looking for a way to speed up JGit. But we're already roughly on-par with CGit performance. Dropping 1m7s on a clone is great, but not at the expense of 6.5% larger network transfer. For most clients, 25 MiB of additional data transfer may be much more significant time than 1m7s saved doing server-side computation. >> That's what I also liked about my --create-cache flag. > > I do agree on that point. And I like it too. I'm not sure I like it so much anymore. :-) The idea was half-baked, and came at the end of a long day, and after putting my cranky infant son down to sleep way past his normal bed time. I claim I was a sleep deprived new parent who wasn't thinking things through enough before writing an email to git@vger. >> sendfile() call for the bulk of the content. I think we can just hand >> off the major streaming to the kernel. > > While this might look like a good idea in theory, did you actually > profile it to see if that would make a noticeable difference? The > pkt-line framing allows for asynchronous messages to be sent over a > sideband, No, of course not. The pkt-line framing is pretty low overhead, but copying kernel buffer to userspace back to kernel buffer sort of sucks for 400 MiB of data. sendfile() on 400 MiB to a network socket is much easier when its all kernel space. I figured, if it all worked out already to just dump the pack to the wire as-is, then we probably should also try to go for broke and reduce the userspace copying. It might not matter to your desktop, but ask John Hawley (CC'd) about kernel.org and the git traffic volume he is serving. They are doing more than 1 million git:// requests per day now. >> Plus we can safely do byte range requests for resumable clone within >> the cached pack part of the stream. > > That part I'm not sure of. We are still facing the same old issues > here, as some mirrors might have the same commit edges for a cache pack > but not necessarily the same packing result, etc. So I'd keep that out > of the picture for now. I don't think its that hard. If we modify the transfer protocol to allow the server to denote boundaries between packs, the server can send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer to the client. A client asking for resume of a cached pack presents its original want list, these two SHA-1s, and the byte offset he wants to restart from. The server validates the want set is still reachable, that the cached pack exists, and that the cached pack tips are reachable from current refs. If all of that is true, it validates the trailing SHA-1 in the pack matches what the client gave it. If that matches, it should be OK to resume transfer from where the client asked for. Then its up to the server administrators of a round-robin serving cluster to ensure that the same cached pack is available on all nodes, so that a resuming client is likely to have his request succeed. This isn't impossible. If the server operator cares they can keep the prior cached pack for several weeks after creating a newer cached pack, giving clients plenty of time to resume a broken clone. Disk is fairly inexpensive these days. But its perhaps pointless, see above. :-) -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html