Re: [PATCH] Prevent megablobs from gunking up git packs

"Dana How" <danahow@xxxxxxxxx> · Tue, 22 May 2007 00:33:23 -0700

On 5/21/07, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
Dana How <danahow@xxxxxxxxx> wrote:
> ...  Operations
> such as "git-log --pretty=oneline" were about 30X faster
> on a cold cache and 2 to 3X faster otherwise.  Process sizes
> remained reasonable.

Can you give me details about your system?  Is this a 64 bit binary?
RHEL4/Nahant on an Opteron. Yes.

What is your core.packedGitWindowSize and core.packedGitLimit set to?
I didn't change the default.

It sounds like the packed version was almost 3 GiB smaller, but
was slower because we were mmap'ing far too much data at startup
and that was making your OS page in things that you didn't really
need to have.
The difference in size is because of the "Custom compression levels"
patch -- now the loose objects use Z_BEST_SPEED,  whereas the packs
use Z_DEFAULT_COMPRESSION.

Mind trying git-log with a smaller core.packedGitWindow{Size,Limit}?
Perhaps its just as simple as our defaults are far far too high for
your workload...
I think that's a good idea and it should be easy to try tomorrow.
It will improve the cold cache case definitely.

But we need to consider both *read* and *creation* performance.
The portion of the repo I imported to git grows at about 500MB/week
(compressed).  Should I repack -a every week? Every month?  In any case,
should I use default window/depth, or 0/0?  If default, run-times are
prohibitive (in fact, I've always killed each attempt so the machine
could be used for "real" work), and if 0/0, then I lose deltification
on all objects.

These megablobs really are outliers and stress the "one size fits
all" approach of packing in git.  As a thought experiment,
let's (1) pretend git-repack takes --max-blob-size= and --max-pack-size= ,
(2) pretend the patch doesn't add the repack.maxblobsize variable,
and (3) do the following:
% git-repack -a -d --max-blob-size=256
% git-repack --max-pack-size=2047 --window=0 --depth=0
The first step makes a digestible 13MB packfile, and the second
puts all the megablobs in 6+ 2GB packfiles.  Is there really any
advantage to carrying out the second step?  If I'm processing
a 100MB+ blob,  do I really care about an extra open(2) call?

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html