Re: git status OOM on mmap of large file

Jeff King <peff@xxxxxxxx> · Thu, 24 Jan 2019 14:18:36 -0500

On Thu, Jan 24, 2019 at 02:38:10PM -0400, Joey Hess wrote:

> > Just off the top of my head, something like:
> > 
> >   /* guess that the filtered output will be the same size as the original */
> >   hint = len;
> > 
> >   /* allocate 10% extra in case the clean size is slightly larger */
> >   hint *= 1.1;
> > 
> >   /*
> >    * in any case, never go higher than half of core.bigfileThreshold.
> >    * We'd like to avoid allocating more bytes than that, and that still
> >    * gives us room for our strbuf to preemptively double if our guess is
> >    * just a little on the low side.
> >    */
> >   if (hint > big_file_threshold / 2)
> > 	hint = big_file_threshold / 2;
> > 
> > But to be honest, I have no idea if that would even produce measurable
> > benefits over simply growing the strbuf from scratch (i.e., hint==0).
> 
> Half of 512 MB is still quite a lot of memory to default to using in
> this situation. Eg smaller VPS's still often only have a GB or two of ram.

I think you'd want to drop core.bigFileThreshold on such a server, just
because Git will happily keep 2*(bigFileThreshold-1) in memory to do a
diff. But that nit aside...

> I did some benchmarking, using cat as the clean filter:
> [...]
> From this, it looks like the file has to be quite large before the
> preallocation makes a sizable improvement to runtime, and the
> smudge/clean filters have to be used for actual content filtering
> (not for hash generation purposes as git-annex and git-lfs use it).
> An unusual edge case I think. So hint == 0 seems fine.

Thanks for these timings! I agree that "hint == 0" is probably
reasonable, then.

I suppose there's no reason not to proceed with a patch around this.
For most cases it's really only half the solution (since smudging is
going to run into the same problem). But fixing that is quite a bit more
involved, and the change itself will be largely orthogonal.

-Peff