Re: [RFC] Add --create-cache to repack

Nicolas Pitre <nico@xxxxxxxxxxx> · Fri, 28 Jan 2011 23:08:23 -0500 (EST)

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> >> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >> >
> >> >> This started because I was looking for a way to speed up clones coming
> >> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> 
> Well, scratch the idea in this thread.  I think.
> 
> I retested JGit vs. CGit on an identical linux-2.6 repository.  The
> repository was fully packed, but had two pack files.  362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
> 
> The two implementations take the same amount of time to generate the
> clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
> out this data because improvements made to JGit may show similar
> improvements to CGit given how close they are in running time.

What are those improvements?

Now, the fact that JGit is so close to CGit must be because the actual 
cost is outside of them such as within zlib, otherwise the C code should 
normally always be faster, right?

Looking at the profile for "git rev-list --objects --all > /dev/null" 
for the object enumeration phase, we have:

# Samples: 1814637
#
# Overhead          Command  Shared Object  Symbol
# ........  ...............  .............  ......
#
    28.81%              git  /home/nico/bin/git  [.] lookup_object
    12.21%              git  /lib64/libz.so.1.2.3  [.] inflate
    10.49%              git  /lib64/libz.so.1.2.3  [.] inflate_fast
     7.47%              git  /lib64/libz.so.1.2.3  [.] inflate_table
     6.66%              git  /lib64/libc-2.11.2.so  [.] __GI_memcpy
     5.66%              git  /home/nico/bin/git  [.] find_pack_entry_one
     2.98%              git  /home/nico/bin/git  [.] decode_tree_entry
     2.73%              git  /lib64/libc-2.11.2.so  [.] _int_malloc
     2.71%              git  /lib64/libz.so.1.2.3  [.] adler32
     2.63%              git  /home/nico/bin/git  [.] process_tree
     1.58%              git  [kernel]       [k] 0xffffffff8112fc0c
     1.44%              git  /lib64/libc-2.11.2.so  [.] __strlen_sse2
     1.31%              git  /home/nico/bin/git  [.] tree_entry
     1.10%              git  /lib64/libc-2.11.2.so  [.] _int_free
     0.96%              git  /home/nico/bin/git  [.] patch_delta
     0.92%              git  /lib64/libc-2.11.2.so  [.] malloc_consolidate
     0.86%              git  /lib64/libc-2.11.2.so  [.] __GI_vfprintf
     0.80%              git  /home/nico/bin/git  [.] create_object
     0.80%              git  /home/nico/bin/git  [.] lookup_blob
     0.63%              git  /home/nico/bin/git  [.] update_tree_entry
[...]

So we've got lookup_object() clearly at the top.  I suspect the 
hashcmp() in there, which probably gets inlined, is responsible for most 
cycles.  There is certainly a better way here, and probably in JGit you 
rely on some optimized facility provided by the language/library to 
perform that lookup.  So there is probably some easy improvements that 
can be made here.

Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent 
directly in the zlib code, making it the biggest cost.  This is rather 
unavoidable unless the data structure is changed.  And pack v4 would 
probably move things such as find_pack_entry_one, decode_tree_entry, 
process_tree and tree_entry off the radar as well.

The object writeout phase should pretty much be network bound.

> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread.  It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB.  I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.  The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack.  So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack.  (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.)  If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.

Yeah... this sucks.

> I'm not sure I like it so much anymore.  :-)
> 
> The idea was half-baked, and came at the end of a long day, and after
> putting my cranky infant son down to sleep way past his normal bed
> time.  I claim I was a sleep deprived new parent who wasn't thinking
> things through enough before writing an email to git@vger.

Well, this is still valuable information to archive.

And I wish I had been able to still write such quality emails when I was 
a new parent.  ;-)

> >> sendfile() call for the bulk of the content.  I think we can just hand
> >> off the major streaming to the kernel.
> >
> > While this might look like a good idea in theory, did you actually
> > profile it to see if that would make a noticeable difference?  The
> > pkt-line framing allows for asynchronous messages to be sent over a
> > sideband,
> 
> No, of course not.  The pkt-line framing is pretty low overhead, but
> copying kernel buffer to userspace back to kernel buffer sort of sucks
> for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
> much easier when its all kernel space.

Of course.  But still... If you save 0.5 second by avoiding the copy to 
and from user space of that 400 MiB (based on my machine which can do 
1670MB/s) that's pretty much insignificant compared to the total time 
for the clone, and therefore the wrong thing to optimize given the 
required protocol changes.

> I figured, if it all worked
> out already to just dump the pack to the wire as-is, then we probably
> should also try to go for broke and reduce the userspace copying.  It
> might not matter to your desktop, but ask John Hawley (CC'd) about
> kernel.org and the git traffic volume he is serving.  They are doing
> more than 1 million git:// requests per day now.

Impressive.  However I suspect that the vast majority of those requests 
are from clients making a connection just to realize they're up to date 
already.  I don't think the user space copying is really a problem.

Of course, if we could have used sendfile() freely in, say, 
copy_pack_data() then we would have done so long ago.  But we are 
checksuming the data we create on the fly with the data we reuse from 
disk so this is not necessarily a gain.

> >> Plus we can safely do byte range requests for resumable clone within
> >> the cached pack part of the stream.
> >
> > That part I'm not sure of.  We are still facing the same old issues
> > here, as some mirrors might have the same commit edges for a cache pack
> > but not necessarily the same packing result, etc.  So I'd keep that out
> > of the picture for now.
> 
> I don't think its that hard.  If we modify the transfer protocol to
> allow the server to denote boundaries between packs, the server can
> send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
> to the client.  A client asking for resume of a cached pack presents
> its original want list, these two SHA-1s, and the byte offset he wants
> to restart from.  The server validates the want set is still
> reachable, that the cached pack exists, and that the cached pack tips
> are reachable from current refs.  If all of that is true, it validates
> the trailing SHA-1 in the pack matches what the client gave it.  If
> that matches, it should be OK to resume transfer from where the client
> asked for.

This is still an half solution.  If your network connection drops after 
the first 52 MiB of transfer given the scenario you provided then you're 
still screwed.

Nicolas