Re: git status OOM on mmap of large file

Jeff King <peff@xxxxxxxx> · Thu, 24 Jan 2019 07:10:38 -0500

On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote:

> joey@darkstar:~/tmp/t> ls -l big-file
> -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
> joey@darkstar:~/tmp/t> git status
> fatal: Out of memory, realloc failed
> 
> This file is checked into git, but using a smudge/clean filter, so the actual
> data checked into git is a hash. I did so using git-annex v7 mode, but I
> suppose git lfs would cause the same problem.
> 
> [pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> 
> Why status needs to mmap a large file that is not modified
> and that is configured to pass through smudge/clean, I don't know.
> It seems like it should be possible for status work in this situation.

One minor point: I don't think this is us mmap-ing the file. The
descriptor is -1, and Git never uses PROT_WRITE. This is likely your
libc using mmap to fulfill a malloc() request.

That said, it just turns the question into: why did Git try to malloc
that many bytes?  If I reproduce your example (using a 100MB file) and
set GIT_ALLOC_LIMIT in the environment, the backtrace to die() is:

  #1  0x0000555555786d65 in memory_limit_check (size=104857601, gentle=0) at wrapper.c:27
  #2  0x0000555555787084 in xrealloc (ptr=0x0, size=104857601) at wrapper.c:137
  #3  0x000055555575612e in strbuf_grow (sb=0x7fffffffdbf0, extra=104857600) at strbuf.c:98
  #4  0x000055555575731a in strbuf_read (sb=0x7fffffffdbf0, fd=4, hint=104857600) at strbuf.c:429
  #5  0x0000555555664a1f in apply_single_file_filter (path=0x5555558c787c "foo.rand", ...)
  #6  0x0000555555665321 in apply_filter (path=0x5555558c787c "foo.rand", ...)
  ...

Looking at apply_single_file_filter(), it's not the _original_ file that
it's trying to store, but rather the data coming back from the filter.
It's just that we use the original file size as a hint!

In this case (and I'd venture to say in most gigantic-file cases) it's
much larger than we need, to the point of causing a problem.

In other words, I think this patch fixes your problem:

diff --git a/convert.c b/convert.c
index 0d89ae7c23..85aebe2ed3 100644
--- a/convert.c
+++ b/convert.c
@@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
 	if (start_async(&async))
 		return 0;	/* error was already reported */
 
-	if (strbuf_read(&nbuf, async.out, len) < 0) {
+	if (strbuf_read(&nbuf, async.out, 0) < 0) {
 		err = error(_("read from external filter '%s' failed"), cmd);
 	}
 	if (close(async.out)) {

though possibly we should actually continue to use the file size as a
hint up to a certain point, which avoids reallocations for more "normal"
filters where the input and output sizes are in the same ballpark.

Just off the top of my head, something like:

  /* guess that the filtered output will be the same size as the original */
  hint = len;

  /* allocate 10% extra in case the clean size is slightly larger */
  hint *= 1.1;

  /*
   * in any case, never go higher than half of core.bigfileThreshold.
   * We'd like to avoid allocating more bytes than that, and that still
   * gives us room for our strbuf to preemptively double if our guess is
   * just a little on the low side.
   */
  if (hint > big_file_threshold / 2)
	hint = big_file_threshold / 2;

But to be honest, I have no idea if that would even produce measurable
benefits over simply growing the strbuf from scratch (i.e., hint==0).

-Peff