On Fri, Dec 03 2021, Han Xin wrote: > From: Han Xin <hanxin.hx@xxxxxxxxxxxxxxx> > > We used to call "get_data()" in "unpack_non_delta_entry()" to read the > entire contents of a blob object, no matter how big it is. This > implementation may consume all the memory and cause OOM. > > By implementing a zstream version of input_stream interface, we can use > a small fixed buffer for "unpack_non_delta_entry()". > > However, unpack non-delta objects from a stream instead of from an entrie > buffer will have 10% performance penalty. Therefore, only unpack object > larger than the "big_file_threshold" in zstream. See the following > benchmarks: > > hyperfine \ > --setup \ > 'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \ > --prepare 'rm -rf dest.git && git init --bare dest.git' \ > -n 'old' 'git -C dest.git unpack-objects <small.pack' \ > -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \ > -n 'new (small threshold)' \ > 'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack' > Benchmark 1: old > Time (mean ± σ): 6.075 s ± 0.069 s [User: 5.047 s, System: 0.991 s] > Range (min … max): 6.018 s … 6.189 s 10 runs > > Benchmark 2: new > Time (mean ± σ): 6.090 s ± 0.033 s [User: 5.075 s, System: 0.976 s] > Range (min … max): 6.030 s … 6.142 s 10 runs > > Benchmark 3: new (small threshold) > Time (mean ± σ): 6.755 s ± 0.029 s [User: 5.150 s, System: 1.560 s] > Range (min … max): 6.711 s … 6.809 s 10 runs > > Summary > 'old' ran > 1.00 ± 0.01 times faster than 'new' > 1.11 ± 0.01 times faster than 'new (small threshold)' So before we wrote used core.bigfilethreshold for two things (or more?): Whether we show a diff for it (we mark it "binary") and whether it's split into a loose object. Now it's three things, we've added a "this is a threshold when we'll stream the object" to that. Might it make sense to squash something like this in, so we can have our cake & eat it too? With this I get, where HEAD~0 is this change: Summary './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~0' ran 1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~1' 1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'origin/master' 1.01 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~0' 1.06 ± 0.14 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'origin/master' 1.20 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~1' I.e. it's 5% slower, not 20% (haven't looked into why), but we'll not stream out 16k..128MB objects (maybe the repo has even bigger ones?) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a1..601b7a2418f 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.bigFileStreamingThreshold:: + Files larger than this will be streamed out to a temporary + object file while being hashed, which will when be renamed + in-place to a loose object, particularly if the + `core.bigFileThreshold' setting dictates that they're always + written out as loose objects. ++ +Default is 128 MiB on all platforms. ++ +Common unit suffixes of 'k', 'm', or 'g' are supported. + core.excludesFile:: Specifies the pathname to the file that contains patterns to describe paths that are not meant to be tracked, in addition diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index bedc494e2db..94ce275c807 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -400,7 +400,7 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size, void *buf; /* Write large blob in stream without allocating full buffer. */ - if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) { + if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) { write_stream_blob(nr, size); return; } diff --git a/cache.h b/cache.h index eba12487b99..4037c7fd849 100644 --- a/cache.h +++ b/cache.h @@ -964,6 +964,7 @@ extern size_t packed_git_window_size; extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; +extern unsigned long big_file_streaming_threshold; extern unsigned long pack_size_limit_cfg; /* diff --git a/config.c b/config.c index c5873f3a706..7b122a142a8 100644 --- a/config.c +++ b/config.c @@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.bigfilestreamingthreshold")) { + big_file_streaming_threshold = git_config_ulong(var, value); + return 0; + } + if (!strcmp(var, "core.packedgitlimit")) { packed_git_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index 9da7f3c1a19..4fcc3de7417 100644 --- a/environment.c +++ b/environment.c @@ -46,6 +46,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; size_t delta_base_cache_limit = 96 * 1024 * 1024; unsigned long big_file_threshold = 512 * 1024 * 1024; +unsigned long big_file_streaming_threshold = 128 * 1024 * 1024; int pager_use_color = 1; const char *editor_program; const char *askpass_program;