Re: [PATCH 0/5] handling 4GB .idx files

Jeff King <peff@xxxxxxxx> · Mon, 16 Nov 2020 18:49:39 -0500

On Mon, Nov 16, 2020 at 08:30:34AM -0500, Derrick Stolee wrote:

> > which took almost 13 minutes of CPU to run, and peaked around 15GB of
> > RAM (and takes about 6.7GB on disk).
> 
> I was thinking that maybe the RAM requirements would be lower
> if we batched the fast-import calls and then repacked, but then
> the repack would probably be just as expensive.

I think it's even worse. Fast-import just holds enough data to create
the index (sha1, etc), but pack-objects is also holding data to support
the delta search, etc. A quick (well, quick to invoke, not to run):

   git show-index <.git/objects/pack/pack-*.idx |
   awk '{print $2}' |
   git pack-objects foo --all-progress

on the fast-import pack seems to cap out around 27GB.

I doubt you could do much better overall than fast-import in terms of
CPU. The trick is really that you need to have a matching content/sha1
pair for 154M objects, and that's where most of the time goes. If we
lied about what's in each object (just generating an index with sha1
...0001, ...0002, etc), we could go much faster. But it's a much less
interesting test then.

> > That's the most basic test I think you could do. More interesting is
> > looking at entries that are actually after the 4GB mark. That requires
> > dumping the whole index:
> > 
> >   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
> >   git cat-file blob $final
> 
> Could you also (after running the test once) determine the largest
> SHA-1, at least up to unique short-SHA? Then run something like
> 
> 	git cat-file blob fffffe
> 
> Since your loop is hard-coded, you could even use the largest full
> SHA-1.

That $final is the highest sha1. We could hard-code it, yes (and the
resulting lookup via cat-file is quite fast; it's the linear index dump
that's slow). We'd need the matching sha256 version, too. But it's
really the generation of the data that's the main issue.

> Naturally, nothing short of a full .idx verification would be
> completely sound, and we are already generating an enormous repo.

Yep.

> > So I dunno. I wouldn't be opposed to codifying some of that in a script,
> > but I can't imagine anybody ever running it unless they were working on
> > this specific problem.
> 
> It would be good to have this available somewhere in the codebase to
> run whenever testing .idx changes. Perhaps create a new prerequisite
> specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
> environment variable?

My feeling is that anybody who's really interested in playing with this
topic can find this thread in the archive. I don't think they're really
any worse off there than with a bit-rotting script in the repo that
nobody ever runs.

But if somebody wants to write up a test script, I'm happy to review it.

> It would be helpful to also write a multi-pack-index on top of this
> .idx to ensure we can handle that case, too.

I did run "git multi-pack-index write" on the resulting repo, which
completed in a reasonable amount of time (maybe 30-60s). And then
confirmed that lookups in the midx work just fine.

-Peff