Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Fri, 28 Jan 2011 17:32:01 -0800

On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> On Fri, 28 Jan 2011, Shawn Pearce wrote:
>
>> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
>> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
>> >
>> >> This started because I was looking for a way to speed up clones coming
>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,

Well, scratch the idea in this thread.  I think.

I retested JGit vs. CGit on an identical linux-2.6 repository.  The
repository was fully packed, but had two pack files.  362M and 57M,
and was created by packing a 1 month old master, marking it .keep, and
then repacking -a -d to get most recent last month into another pack.
This results in some files that should be delta compressed together
being stored whole in the two packs (obviously).

The two implementations take the same amount of time to generate the
clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
out this data because improvements made to JGit may show similar
improvements to CGit given how close they are in running time.

I fully implemented the reuse of a cached pack behind a thin pack idea
I was trying to describe in this thread.  It saved 1m7s off the JGit
running time, but increased the data transfer by 25 MiB.  I didn't
expect this much of an increase, I honestly expected the thin pack
portion to be well, thinner.  The issue is the thin pack cannot delta
against all of the history, its only delta compressing against the tip
of the cached pack.  So long-lived side branches that forked off an
older part of the history aren't delta compressing well, or at all,
and that is significantly bloating the thin pack.  (Its also why that
"newer" pack is 57M, but should be 14M if correctly combined with the
cached pack.)  If I were to consider all of the objects in the cached
pack as potential delta base candidates for the thin pack, the entire
benefit of the cached pack disappears.

Which leaves me with dropping this idea.  I started it because I was
actually looking for a way to speed up JGit.  But we're already
roughly on-par with CGit performance.  Dropping 1m7s on a clone is
great, but not at the expense of 6.5% larger network transfer.  For
most clients, 25 MiB of additional data transfer may be much more
significant time than 1m7s saved doing server-side computation.

>> That's what I also liked about my --create-cache flag.
>
> I do agree on that point.   And I like it too.

I'm not sure I like it so much anymore.  :-)

The idea was half-baked, and came at the end of a long day, and after
putting my cranky infant son down to sleep way past his normal bed
time.  I claim I was a sleep deprived new parent who wasn't thinking
things through enough before writing an email to git@vger.

>> sendfile() call for the bulk of the content.  I think we can just hand
>> off the major streaming to the kernel.
>
> While this might look like a good idea in theory, did you actually
> profile it to see if that would make a noticeable difference?  The
> pkt-line framing allows for asynchronous messages to be sent over a
> sideband,

No, of course not.  The pkt-line framing is pretty low overhead, but
copying kernel buffer to userspace back to kernel buffer sort of sucks
for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
much easier when its all kernel space.  I figured, if it all worked
out already to just dump the pack to the wire as-is, then we probably
should also try to go for broke and reduce the userspace copying.  It
might not matter to your desktop, but ask John Hawley (CC'd) about
kernel.org and the git traffic volume he is serving.  They are doing
more than 1 million git:// requests per day now.

>> Plus we can safely do byte range requests for resumable clone within
>> the cached pack part of the stream.
>
> That part I'm not sure of.  We are still facing the same old issues
> here, as some mirrors might have the same commit edges for a cache pack
> but not necessarily the same packing result, etc.  So I'd keep that out
> of the picture for now.

I don't think its that hard.  If we modify the transfer protocol to
allow the server to denote boundaries between packs, the server can
send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
to the client.  A client asking for resume of a cached pack presents
its original want list, these two SHA-1s, and the byte offset he wants
to restart from.  The server validates the want set is still
reachable, that the cached pack exists, and that the cached pack tips
are reachable from current refs.  If all of that is true, it validates
the trailing SHA-1 in the pack matches what the client gave it.  If
that matches, it should be OK to resume transfer from where the client
asked for.

Then its up to the server administrators of a round-robin serving
cluster to ensure that the same cached pack is available on all nodes,
so that a resuming client is likely to have his request succeed.  This
isn't impossible.  If the server operator cares they can keep the
prior cached pack for several weeks after creating a newer cached
pack, giving clients plenty of time to resume a broken clone.  Disk is
fairly inexpensive these days.

But its perhaps pointless, see above.  :-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html