Re: [PATCH 0/5] handling 4GB .idx files

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 16 Nov 2020 08:30:34 -0500

On 11/15/2020 11:10 PM, Jeff King wrote:
> On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:
> 
>> On 13.11.2020 06:06, Jeff King wrote:
>>> I recently ran into a case where Git could not read the pack it had
>>> produced via running "git repack". The culprit turned out to be an .idx
>>> file which crossed the 4GB barrier (in bytes, not number of objects).
>>> This series fixes the problems I saw, along with similar ones I couldn't
>>> trigger in practice, and protects the .idx loading code against integer
>>> overflows that would fool the size checks.
>>
>> Would it be feasible to have a test case for this large index case? This
>> should very certainly have an EXPENSIVE tag, or might even not yet work
>> on windows. But hopefully someday I'll find some more time to push large
>> object support on windows forward, and these kind of tests would really
>> help then.
> 
> I think it would be a level beyond what we usually consider even for
> EXPENSIVE. The cheapest I could come up with to generate the case is:

I agree that the cost of this test is more than I would expect for
EXPENSIVE.

>   perl -e '
> 	for (0..154_000_000) {
> 		print "blob\n";
> 		print "data <<EOF\n";
> 		print "$_\n";
> 		print "EOF\n";
> 	}
>   ' |
>   git fast-import
> 
> which took almost 13 minutes of CPU to run, and peaked around 15GB of
> RAM (and takes about 6.7GB on disk).

I was thinking that maybe the RAM requirements would be lower
if we batched the fast-import calls and then repacked, but then
the repack would probably be just as expensive.

> In the resulting repo, the old code barfed on lookups:
> 
>   $ blob=$(echo 0 | git hash-object --stdin)
>   $ git cat-file blob $blob
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file
> 
> whereas now it works:
> 
>   $ git cat-file blob $blob
>   0
> 
> That's the most basic test I think you could do. More interesting is
> looking at entries that are actually after the 4GB mark. That requires
> dumping the whole index:
> 
>   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
>   git cat-file blob $final

Could you also (after running the test once) determine the largest
SHA-1, at least up to unique short-SHA? Then run something like

	git cat-file blob fffffe

Since your loop is hard-coded, you could even use the largest full
SHA-1.

Naturally, nothing short of a full .idx verification would be
completely sound, and we are already generating an enormous repo.

> So I dunno. I wouldn't be opposed to codifying some of that in a script,
> but I can't imagine anybody ever running it unless they were working on
> this specific problem.

It would be good to have this available somewhere in the codebase to
run whenever testing .idx changes. Perhaps create a new prerequisite
specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
environment variable?

It would be helpful to also write a multi-pack-index on top of this
.idx to ensure we can handle that case, too.

Thanks,
-Stolee