On Thu, Jan 24, 2019 at 02:38:10PM -0400, Joey Hess wrote: > > Just off the top of my head, something like: > > > > /* guess that the filtered output will be the same size as the original */ > > hint = len; > > > > /* allocate 10% extra in case the clean size is slightly larger */ > > hint *= 1.1; > > > > /* > > * in any case, never go higher than half of core.bigfileThreshold. > > * We'd like to avoid allocating more bytes than that, and that still > > * gives us room for our strbuf to preemptively double if our guess is > > * just a little on the low side. > > */ > > if (hint > big_file_threshold / 2) > > hint = big_file_threshold / 2; > > > > But to be honest, I have no idea if that would even produce measurable > > benefits over simply growing the strbuf from scratch (i.e., hint==0). > > Half of 512 MB is still quite a lot of memory to default to using in > this situation. Eg smaller VPS's still often only have a GB or two of ram. I think you'd want to drop core.bigFileThreshold on such a server, just because Git will happily keep 2*(bigFileThreshold-1) in memory to do a diff. But that nit aside... > I did some benchmarking, using cat as the clean filter: > [...] > From this, it looks like the file has to be quite large before the > preallocation makes a sizable improvement to runtime, and the > smudge/clean filters have to be used for actual content filtering > (not for hash generation purposes as git-annex and git-lfs use it). > An unusual edge case I think. So hint == 0 seems fine. Thanks for these timings! I agree that "hint == 0" is probably reasonable, then. I suppose there's no reason not to proceed with a patch around this. For most cases it's really only half the solution (since smudging is going to run into the same problem). But fixing that is quite a bit more involved, and the change itself will be largely orthogonal. -Peff