Jeff King wrote: > Looking at apply_single_file_filter(), it's not the _original_ file that > it's trying to store, but rather the data coming back from the filter. > It's just that we use the original file size as a hint! Thanks much for working that out! > In other words, I think this patch fixes your problem: > > diff --git a/convert.c b/convert.c > index 0d89ae7c23..85aebe2ed3 100644 > --- a/convert.c > +++ b/convert.c > @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le > if (start_async(&async)) > return 0; /* error was already reported */ > > - if (strbuf_read(&nbuf, async.out, len) < 0) { > + if (strbuf_read(&nbuf, async.out, 0) < 0) { > err = error(_("read from external filter '%s' failed"), cmd); > } > if (close(async.out)) { Yes, confirmed that does fix it. > though possibly we should actually continue to use the file size as a > hint up to a certain point, which avoids reallocations for more "normal" > filters where the input and output sizes are in the same ballpark. > > Just off the top of my head, something like: > > /* guess that the filtered output will be the same size as the original */ > hint = len; > > /* allocate 10% extra in case the clean size is slightly larger */ > hint *= 1.1; > > /* > * in any case, never go higher than half of core.bigfileThreshold. > * We'd like to avoid allocating more bytes than that, and that still > * gives us room for our strbuf to preemptively double if our guess is > * just a little on the low side. > */ > if (hint > big_file_threshold / 2) > hint = big_file_threshold / 2; > > But to be honest, I have no idea if that would even produce measurable > benefits over simply growing the strbuf from scratch (i.e., hint==0). Half of 512 MB is still quite a lot of memory to default to using in this situation. Eg smaller VPS's still often only have a GB or two of ram. When the clean filter is being used in a way that doesn't involve hashes of large files, it will mostly be operating on typically sized source code files. So capping the maximum hint size around the size of a typical source code file would be plenty for both common cases for the clean filter. I did some benchmarking, using cat as the clean filter: git status 32 kb file, hint == len time 3.865 ms (3.829 ms .. 3.943 ms) 0.994 R² (0.987 R² .. 0.999 R²) mean 3.934 ms (3.889 ms .. 4.021 ms) std dev 191.8 μs (106.8 μs .. 291.8 μs) git status 32 kb file, hint == 0 time 3.887 ms (3.751 ms .. 4.064 ms) 0.992 R² (0.986 R² .. 0.998 R²) mean 4.002 ms (3.931 ms .. 4.138 ms) std dev 292.2 μs (189.0 μs .. 498.3 μs) git status 1 mb file, hint == len time 3.942 ms (3.816 ms .. 4.078 ms) 0.995 R² (0.991 R² .. 0.999 R²) mean 3.969 ms (3.916 ms .. 4.054 ms) std dev 220.1 μs (155.1 μs .. 304.3 μs) git status 1 mb file, hint == 0 time 3.869 ms (3.836 ms .. 3.911 ms) 0.998 R² (0.995 R² .. 1.000 R²) mean 3.895 ms (3.868 ms .. 3.947 ms) std dev 112.3 μs (47.93 μs .. 182.7 μs) git status 1 gb file, hint == len time 7.173 s (6.834 s .. 7.564 s) 0.999 R² (0.999 R² .. 1.000 R²) mean 7.560 s (7.369 s .. 7.903 s) std dev 333.2 ms (27.65 ms .. 412.2 ms) git status 1 gb file, hint == 0 time 7.652 s (6.307 s .. 8.263 s) 0.996 R² (0.992 R² .. 1.000 R²) mean 8.082 s (7.843 s .. 8.202 s) std dev 232.3 ms (2.362 ms .. 277.1 ms) >From this, it looks like the file has to be quite large before the preallocation makes a sizable improvement to runtime, and the smudge/clean filters have to be used for actual content filtering (not for hash generation purposes as git-annex and git-lfs use it). An unusual edge case I think. So hint == 0 seems fine. -- see shy jo