On Wed, Feb 28, 2018 at 05:58:50PM +0700, Duy Nguyen wrote: > > Yeah, the per object memory footprint is not great. Around 100 million > > objects it becomes pretty ridiculous. I started to dig into it a year or > > three ago when I saw such a case, but it turned out to be something that > > we could prune. > > We could? What could we prune? Sorry, I just meant that my 100 million-object case turned out not to need all those objects, and I was able to prune it down. No code fixes came out of it. ;) > > The torvalds/linux fork network has ~23 million objects, > > so it's probably 7-8 GB of book-keeping. Which is gross, but 64GB in a > > server isn't uncommon these days. > > I wonder if we could just do book keeping for some but not all objects > because all objects simply do not scale. Say we have a big pack of > many GBs, could we keep the 80% of its bottom untouched, register the > top 20% (mostly non-blobs, and some more blobs as delta base) for > repack? We copy the bottom part to the new pack byte-by-byte, then > pack-objects rebuilds the top part with objects from other sources. Yes, though I think it would take a fair bit of surgery to do internally. And some features (like bitmap generation) just wouldn't work at all. I suspect you could simulate it, though, by just packing your subset with pack-objects (feeding it directly without using "--revs") and then catting the resulting packfiles together with a fixed-up header. At one point I played with a "fast pack" that would just cat packfiles together. My goal was to make cases with 10,000 packs workable by creating one lousy pack, and then repacking that lousy pack with a "real" repack. In the end I abandoned it in favor of fixing the performance problems from trying to make a real pack of 10,000 packs. :) But I might be able to dig it up if you want to experiment in that direction. > They are 32 bytes per entry, so it should take less than object_entry. > I briefly wondered if we should fall back to external rev-list too, > just to free that memory. > > So about 200 MB for those objects (or maybe more for commits). Add 256 > MB delta cache on top, it's still a bit far from 1.7G. There's > something I'm still missing. Are you looking at RSS or heap? Keep in mind that you're mmap-ing what's probably a 1GB packfile on disk. If you're under memory pressure that won't all stay resident, but some of it will be counted in RSS. > Pity we can't do the same for 'struct object'. Most of the time we > have a giant .idx file with most hashes. We could look up in both > places: the hash table in object.c, and the idx file, to find an > object. Then those objects that are associated with .idx file will not > need "oid" field (needed to as key for the hash table). But I see no > way to make that change. Yeah, that would be pretty invasive, I think. I also wonder if it would perform worse due to cache effects. -Peff