On 11/15/2020 11:10 PM, Jeff King wrote: > On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote: > >> On 13.11.2020 06:06, Jeff King wrote: >>> I recently ran into a case where Git could not read the pack it had >>> produced via running "git repack". The culprit turned out to be an .idx >>> file which crossed the 4GB barrier (in bytes, not number of objects). >>> This series fixes the problems I saw, along with similar ones I couldn't >>> trigger in practice, and protects the .idx loading code against integer >>> overflows that would fool the size checks. >> >> Would it be feasible to have a test case for this large index case? This >> should very certainly have an EXPENSIVE tag, or might even not yet work >> on windows. But hopefully someday I'll find some more time to push large >> object support on windows forward, and these kind of tests would really >> help then. > > I think it would be a level beyond what we usually consider even for > EXPENSIVE. The cheapest I could come up with to generate the case is: I agree that the cost of this test is more than I would expect for EXPENSIVE. > perl -e ' > for (0..154_000_000) { > print "blob\n"; > print "data <<EOF\n"; > print "$_\n"; > print "EOF\n"; > } > ' | > git fast-import > > which took almost 13 minutes of CPU to run, and peaked around 15GB of > RAM (and takes about 6.7GB on disk). I was thinking that maybe the RAM requirements would be lower if we batched the fast-import calls and then repacked, but then the repack would probably be just as expensive. > In the resulting repo, the old code barfed on lookups: > > $ blob=$(echo 0 | git hash-object --stdin) > $ git cat-file blob $blob > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file > > whereas now it works: > > $ git cat-file blob $blob > 0 > > That's the most basic test I think you could do. More interesting is > looking at entries that are actually after the 4GB mark. That requires > dumping the whole index: > > final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}') > git cat-file blob $final Could you also (after running the test once) determine the largest SHA-1, at least up to unique short-SHA? Then run something like git cat-file blob fffffe Since your loop is hard-coded, you could even use the largest full SHA-1. Naturally, nothing short of a full .idx verification would be completely sound, and we are already generating an enormous repo. > So I dunno. I wouldn't be opposed to codifying some of that in a script, > but I can't imagine anybody ever running it unless they were working on > this specific problem. It would be good to have this available somewhere in the codebase to run whenever testing .idx changes. Perhaps create a new prerequisite specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_* environment variable? It would be helpful to also write a multi-pack-index on top of this .idx to ensure we can handle that case, too. Thanks, -Stolee