Re: Reduce pack-objects memory footprint?

Duy Nguyen <pclouds@xxxxxxxxx> · Wed, 28 Feb 2018 18:24:27 +0700

On Wed, Feb 28, 2018 at 6:11 PM, Jeff King <peff@xxxxxxxx> wrote:
>> > The torvalds/linux fork network has ~23 million objects,
>> > so it's probably 7-8 GB of book-keeping. Which is gross, but 64GB in a
>> > server isn't uncommon these days.
>>
>> I wonder if we could just do book keeping for some but not all objects
>> because all objects simply do not scale. Say we have a big pack of
>> many GBs, could we keep the 80% of its bottom untouched, register the
>> top 20% (mostly non-blobs, and some more blobs as delta base) for
>> repack? We copy the bottom part to the new pack byte-by-byte, then
>> pack-objects rebuilds the top part with objects from other sources.
>
> Yes, though I think it would take a fair bit of surgery to do
> internally. And some features (like bitmap generation) just wouldn't
> work at all.
>
> I suspect you could simulate it, though, by just packing your subset
> with pack-objects (feeding it directly without using "--revs") and then
> catting the resulting packfiles together with a fixed-up header.
>
> At one point I played with a "fast pack" that would just cat packfiles
> together. My goal was to make cases with 10,000 packs workable by
> creating one lousy pack, and then repacking that lousy pack with a
> "real" repack. In the end I abandoned it in favor of fixing the
> performance problems from trying to make a real pack of 10,000 packs. :)
>
> But I might be able to dig it up if you want to experiment in that
> direction.

Naah it's ok. I'll go similar direction, but I'd repack those pack
files too except the big one. Let's see how it turns out.

>> They are 32 bytes per entry, so it should take less than object_entry.
>> I briefly wondered if we should fall back to external rev-list too,
>> just to free that memory.
>>
>> So about 200 MB for those objects (or maybe more for commits). Add 256
>> MB delta cache on top, it's still a bit far from 1.7G. There's
>> something I'm still missing.
>
> Are you looking at RSS or heap? Keep in mind that you're mmap-ing what's
> probably a 1GB packfile on disk. If you're under memory pressure that
> won't all stay resident, but some of it will be counted in RSS.

Interesting. It was RSS.

>> Pity we can't do the same for 'struct object'. Most of the time we
>> have a giant .idx file with most hashes. We could look up in both
>> places: the hash table in object.c, and the idx file, to find an
>> object. Then those objects that are associated with .idx file will not
>> need "oid" field (needed to as key for the hash table). But I see no
>> way to make that change.
>
> Yeah, that would be pretty invasive, I think. I also wonder if it would
> perform worse due to cache effects.

It should be better because of cache effects, I think. I mean, hash
map is the least cache friendly lookup. Moving most objects out of the
hash table shrinks it, which is even nicer to cache. But we also lose
O(1) when we do binary search on .idx file (after failing to find the
same object in the hash table)
-- 
Duy