Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Fri, 28 Jan 2011 11:15:34 -0800

On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> On Fri, 28 Jan 2011, Shawn Pearce wrote:
>
>> This started because I was looking for a way to speed up clones coming
>> from a JGit server.  Cloning the linux-2.6 repository is painful,
...
>> Later I realized, we can get rid of that cached list of objects and
>> just use the pack itself.
...
> Playing my old record again... I know.  But pack v4 should solve a big
> part of this enumeration cost.

I've said the same thing for years myself.  As much as it would be
nice to fix some of the decompression costs with pack v2/v3, v2/v3 is
very common in the wild, and a new pack encoding is going to be a
fairly complex thing to get added to C Git.  And pack v4 doesn't
eliminate the enumeration, it just makes it faster.

> So that's the idea.  Keep the exact same functionality as we have now,
> without any need for cache management, but making the data structure in
> a form that should improve object enumeration by some magnitude.

That's what I also liked about my --create-cache flag.  Its keeping
the same data we already have, in the same format we already have it
in.  We're just making a more explicit statement that everything in
some pack is about as tightly compressed as it ever would be for a
client, and it isn't going to change anytime soon.  Thus we might as
well tag it with .keep to prevent repack of mucking with it, and we
can take advantage of this to serve the pack to clients very fast.

Over breakfast this morning I made the point to Junio that with the
cached pack and a slight network protocol change (enabled by a
capability of course) we could stop using pkt-line framing when
sending the cached pack part of the stream, and just send the pack
directly down the socket.  That changes the clone of a 400 MB project
like linux-2.6 from being a lot of user space stuff, to just being a
sendfile() call for the bulk of the content.  I think we can just hand
off the major streaming to the kernel.  (Part of the protocol change
is we would need to use multiple SHA-1 checksums in the stream, so we
don't have to re-checksum the existing cached pack.)

I love the idea of some of the concepts in pack v4.  I really do.  But
this sounds a lot simpler to implement, and it lets us completely
eliminate a massive amount of server processing (even under pack v4
you still have object enumeration), in exchange for what might be a
few extra MBs on the wire to the client due to slightly less good
deltas and the use of REF_DELTA in the thin pack used for the most
recent objects.  I don't envision this being used on projects smaller
than git.git itself, if you can gc --aggressive the whole thing in a
minute the cached pack is probably pointless.  But if you have 400+
MB, you want that to be network bound, and have almost no CPU impact
on the server.

Plus we can safely do byte range requests for resumable clone within
the cached pack part of the stream.  And when pack v4 comes along, we
can use this same strategy for an equally large pack v4 pack.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html