Re: [PATCH] Prevent megablobs from gunking up git packs

Junio C Hamano <junkio@xxxxxxx> · Mon, 21 May 2007 23:52:51 -0700

Dana How <danahow@xxxxxxxxx> writes:

> Using fast-import and repack with the max-pack-size patch,
> 3628 commits were imported from Perforce comprising
> 100.35GB (uncompressed) in 38829 blobs,  and saved in
> 7 packfiles of 12.5GB total (--window=0 and --depth=0 were
> used due to runtime limits).  When using these packfiles,
> several git commands showed very large process sizes,
> and some slowdowns (compared to comparable operations
> on the linux kernel repo) were also apparent.
>
> git stores data in loose blobs or in packfiles.  The former
> has essentially now become an exception mechanism,  to store
> exceptionally *young* blobs.  Why not use this to store
> exceptionally *large* blobs as well?  This allows us to
> re-use all the "exception" machinery with only a small change.

Well, I had an impression that mmapping a single loose object
(and then munmapping it after done) would be more expensive than
mmapping a whole pack and accessing that object through window,
as long as you touch the same set of objects and the object in
the pack is not deltified.

> Repacking the entire repository with a max-blob-size of 256KB
> resulted in a single 13.1MB packfile,  as well as 2853 loose
> objects totaling 15.4GB compressed and 100.08GB uncompressed,
> 11 files per objects/xx directory on average.  All was created
> in half the runtime of the previous yet with standard
> --window=10 and --depth=50 parameters.  The data in the
> packfile was 270MB uncompressed in 35976 blobs.  Operations
> such as "git-log --pretty=oneline" were about 30X faster
> on a cold cache and 2 to 3X faster otherwise.  Process sizes
> remained reasonable.

I think more reasonable comparison to figure out what is really
going on would be to create such a pack with the same 0/0 window
and depth (i.e. "keeping the huge objects out of the pack" would
be the only difference with the "horrible" case).  With huge
packs, I wouldn't be surprised if seeking to extract base object
from a far away part of a packfile takes a lot longer than
reading delta and applying the delta to base object that is kept
in the in-core delta base cache.

Also if you mean by "process size" the total VM size, not RSS, I
think it is a wrong measure.  As long as you do not touch the
rest of the pack, even if you mmap a huge packfile, you would
not bring that much data actually into your main memory, would
you?  Well, assuming that your mmap() implementation and virtual
memory subsystem does a descent job... maybe we are spoiled by
Linux here...

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html