Re: [PATCH 0/5] handling 4GB .idx files

Jeff King <peff@xxxxxxxx> · Sun, 15 Nov 2020 23:10:51 -0500

On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:

> On 13.11.2020 06:06, Jeff King wrote:
> > I recently ran into a case where Git could not read the pack it had
> > produced via running "git repack". The culprit turned out to be an .idx
> > file which crossed the 4GB barrier (in bytes, not number of objects).
> > This series fixes the problems I saw, along with similar ones I couldn't
> > trigger in practice, and protects the .idx loading code against integer
> > overflows that would fool the size checks.
> 
> Would it be feasible to have a test case for this large index case? This
> should very certainly have an EXPENSIVE tag, or might even not yet work
> on windows. But hopefully someday I'll find some more time to push large
> object support on windows forward, and these kind of tests would really
> help then.

I think it would be a level beyond what we usually consider even for
EXPENSIVE. The cheapest I could come up with to generate the case is:

  perl -e '
	for (0..154_000_000) {
		print "blob\n";
		print "data <<EOF\n";
		print "$_\n";
		print "EOF\n";
	}
  ' |
  git fast-import

which took almost 13 minutes of CPU to run, and peaked around 15GB of
RAM (and takes about 6.7GB on disk).

In the resulting repo, the old code barfed on lookups:

  $ blob=$(echo 0 | git hash-object --stdin)
  $ git cat-file blob $blob
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file

whereas now it works:

  $ git cat-file blob $blob
  0

That's the most basic test I think you could do. More interesting is
looking at entries that are actually after the 4GB mark. That requires
dumping the whole index:

  final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
  git cat-file blob $final

That takes ~35s to run. Curiously, it also allocates 5GB of heap. For
some reason it decides to make an internal copy of the entries table. I
guess because it reads the file sequentially rather than mmap-ing it,
and 64-bit offsets in v2 idx files can't be resolved until we've read
the whole entry table (and it wants to output the entries in sha1
order).

The checksum bug requires running git-fsck on the repo. That's another 5
minutes of CPU (and even higher peak memory; I think we create a "struct
blob" for each one, and it seems to hit 20GB).

Hitting the other cases that I fixed but never triggered in practice
would need a repo about 4x as large. So figure an hour of CPU and 60GB
of RAM.

So I dunno. I wouldn't be opposed to codifying some of that in a script,
but I can't imagine anybody ever running it unless they were working on
this specific problem.

-Peff