Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Sun, 30 Jan 2011 11:43:22 -0800

On Sun, Jan 30, 2011 at 00:05, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Shawn Pearce <spearce@xxxxxxxxxxx> writes:
>
>> Using this for object enumeration shaves almost 1 minute off server
>> packing time; the clone dropped from 3m28s to 2m29s.  That is close to
>> what I was getting with the cached pack idea, but the network transfer
>> stayed the small 376 MiB.
>
> I like this result.

I'm really leaning towards putting this cached object list into JGit.

I need to shave that 1 minute off server CPU time. I can afford the 41
MiB disk (and kernel buffer cache), but I cannot really continue to
pay the 1 minute of CPU on each clone request for large repositories.
The object list of what is reachable from commit X isn't ever going to
change, and the path hash function is reasonably stable.  With a
version code in the file we can desupport old files if the path hash
function changes.  10% more disk/kernel memory is cheap for some of my
servers compared to 1 minute of CPU, and some explicit cache
management by the server administrator to construct the file.

> The amount of transfer being that small was something I didn't quite
> expect, though.  Doesn't it indicate that our pathname based object
> clustering heuristics is not as effective as we hoped?

I'm not sure I follow your question.

I think the problem here is old side branches that got recently
merged.  Their _best_ delta base was some old revision, possibly close
to where they branched off from.  Using a newer version of the file
for the delta base created a much larger delta.  E.g. consider a file
where in more recent revisions a function was completely rewritten.
If you have to delta compress against that new version, but you use
the older definition of the function, you need to use insert
instructions
for the entire content of that old function.  But if you can delta
compress against the version you branched from (or one much closer to
it in time), your delta would be very small as that function is
handled by the smaller copy instruction.

Our clustering heuristics work fine.

Our thin-pack selection of potential delta base candidates is not.  We
are not very aggressive in loading the delta base window with
potential candidates, which means we miss some really good compression
opportunities.

Ooooh.

I think my test was flawed.  I injected the cached pack's tip as the
edge for the new stuff to delta compress against.  I should have
injected all of the merge bases between the cached pack's tip and the
new stuff.  Although the cached pack tip is one of the merge bases,
its not all of them.  If we inject all of the merge bases, we can find
the revision that this old side branch is based on, and possibly get a
better delta candidate for it.

IIRC, upload-pack would have walked backwards further and found the
merge base for that side branch, and it would have been part of the
delta base candidates.  I think I need to re-do my cached pack test.
Good thing I have history of my source code saved in this fancy
revision control thingy called "git".  :-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html