On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote: > joey@darkstar:~/tmp/t> ls -l big-file > -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file > joey@darkstar:~/tmp/t> git status > fatal: Out of memory, realloc failed > > This file is checked into git, but using a smudge/clean filter, so the actual > data checked into git is a hash. I did so using git-annex v7 mode, but I > suppose git lfs would cause the same problem. > > [pid 6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) > > Why status needs to mmap a large file that is not modified > and that is configured to pass through smudge/clean, I don't know. > It seems like it should be possible for status work in this situation. One minor point: I don't think this is us mmap-ing the file. The descriptor is -1, and Git never uses PROT_WRITE. This is likely your libc using mmap to fulfill a malloc() request. That said, it just turns the question into: why did Git try to malloc that many bytes? If I reproduce your example (using a 100MB file) and set GIT_ALLOC_LIMIT in the environment, the backtrace to die() is: #1 0x0000555555786d65 in memory_limit_check (size=104857601, gentle=0) at wrapper.c:27 #2 0x0000555555787084 in xrealloc (ptr=0x0, size=104857601) at wrapper.c:137 #3 0x000055555575612e in strbuf_grow (sb=0x7fffffffdbf0, extra=104857600) at strbuf.c:98 #4 0x000055555575731a in strbuf_read (sb=0x7fffffffdbf0, fd=4, hint=104857600) at strbuf.c:429 #5 0x0000555555664a1f in apply_single_file_filter (path=0x5555558c787c "foo.rand", ...) #6 0x0000555555665321 in apply_filter (path=0x5555558c787c "foo.rand", ...) ... Looking at apply_single_file_filter(), it's not the _original_ file that it's trying to store, but rather the data coming back from the filter. It's just that we use the original file size as a hint! In this case (and I'd venture to say in most gigantic-file cases) it's much larger than we need, to the point of causing a problem. In other words, I think this patch fixes your problem: diff --git a/convert.c b/convert.c index 0d89ae7c23..85aebe2ed3 100644 --- a/convert.c +++ b/convert.c @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le if (start_async(&async)) return 0; /* error was already reported */ - if (strbuf_read(&nbuf, async.out, len) < 0) { + if (strbuf_read(&nbuf, async.out, 0) < 0) { err = error(_("read from external filter '%s' failed"), cmd); } if (close(async.out)) { though possibly we should actually continue to use the file size as a hint up to a certain point, which avoids reallocations for more "normal" filters where the input and output sizes are in the same ballpark. Just off the top of my head, something like: /* guess that the filtered output will be the same size as the original */ hint = len; /* allocate 10% extra in case the clean size is slightly larger */ hint *= 1.1; /* * in any case, never go higher than half of core.bigfileThreshold. * We'd like to avoid allocating more bytes than that, and that still * gives us room for our strbuf to preemptively double if our guess is * just a little on the low side. */ if (hint > big_file_threshold / 2) hint = big_file_threshold / 2; But to be honest, I have no idea if that would even produce measurable benefits over simply growing the strbuf from scratch (i.e., hint==0). -Peff