On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote: > On 13.11.2020 06:06, Jeff King wrote: > > I recently ran into a case where Git could not read the pack it had > > produced via running "git repack". The culprit turned out to be an .idx > > file which crossed the 4GB barrier (in bytes, not number of objects). > > This series fixes the problems I saw, along with similar ones I couldn't > > trigger in practice, and protects the .idx loading code against integer > > overflows that would fool the size checks. > > Would it be feasible to have a test case for this large index case? This > should very certainly have an EXPENSIVE tag, or might even not yet work > on windows. But hopefully someday I'll find some more time to push large > object support on windows forward, and these kind of tests would really > help then. I think it would be a level beyond what we usually consider even for EXPENSIVE. The cheapest I could come up with to generate the case is: perl -e ' for (0..154_000_000) { print "blob\n"; print "data <<EOF\n"; print "$_\n"; print "EOF\n"; } ' | git fast-import which took almost 13 minutes of CPU to run, and peaked around 15GB of RAM (and takes about 6.7GB on disk). In the resulting repo, the old code barfed on lookups: $ blob=$(echo 0 | git hash-object --stdin) $ git cat-file blob $blob error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file whereas now it works: $ git cat-file blob $blob 0 That's the most basic test I think you could do. More interesting is looking at entries that are actually after the 4GB mark. That requires dumping the whole index: final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}') git cat-file blob $final That takes ~35s to run. Curiously, it also allocates 5GB of heap. For some reason it decides to make an internal copy of the entries table. I guess because it reads the file sequentially rather than mmap-ing it, and 64-bit offsets in v2 idx files can't be resolved until we've read the whole entry table (and it wants to output the entries in sha1 order). The checksum bug requires running git-fsck on the repo. That's another 5 minutes of CPU (and even higher peak memory; I think we create a "struct blob" for each one, and it seems to hit 20GB). Hitting the other cases that I fixed but never triggered in practice would need a repo about 4x as large. So figure an hour of CPU and 60GB of RAM. So I dunno. I wouldn't be opposed to codifying some of that in a script, but I can't imagine anybody ever running it unless they were working on this specific problem. -Peff