Re: [PATCH v4 5/5] unpack-objects: unpack_non_delta_entry() read data in a stream

Han Xin <chiyutianyi@xxxxxxxxx> · Tue, 7 Dec 2021 14:42:23 +0800

On Fri, Dec 3, 2021 at 9:19 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
>
>
> On Fri, Dec 03 2021, Han Xin wrote:
>
> > From: Han Xin <hanxin.hx@xxxxxxxxxxxxxxx>
> >
> > We used to call "get_data()" in "unpack_non_delta_entry()" to read the
> > entire contents of a blob object, no matter how big it is. This
> > implementation may consume all the memory and cause OOM.
> >
> > By implementing a zstream version of input_stream interface, we can use
> > a small fixed buffer for "unpack_non_delta_entry()".
> >
> > However, unpack non-delta objects from a stream instead of from an entrie
> > buffer will have 10% performance penalty. Therefore, only unpack object
> > larger than the "big_file_threshold" in zstream. See the following
> > benchmarks:
> >
> >     hyperfine \
> >       --setup \
> >       'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \
> >       --prepare 'rm -rf dest.git && git init --bare dest.git' \
> >       -n 'old' 'git -C dest.git unpack-objects <small.pack' \
> >       -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \
> >       -n 'new (small threshold)' \
> >       'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack'
> >     Benchmark 1: old
> >       Time (mean ± σ):      6.075 s ±  0.069 s    [User: 5.047 s, System: 0.991 s]
> >       Range (min … max):    6.018 s …  6.189 s    10 runs
> >
> >     Benchmark 2: new
> >       Time (mean ± σ):      6.090 s ±  0.033 s    [User: 5.075 s, System: 0.976 s]
> >       Range (min … max):    6.030 s …  6.142 s    10 runs
> >
> >     Benchmark 3: new (small threshold)
> >       Time (mean ± σ):      6.755 s ±  0.029 s    [User: 5.150 s, System: 1.560 s]
> >       Range (min … max):    6.711 s …  6.809 s    10 runs
> >
> >     Summary
> >       'old' ran
> >         1.00 ± 0.01 times faster than 'new'
> >         1.11 ± 0.01 times faster than 'new (small threshold)'
>
> So before we wrote used core.bigfilethreshold for two things (or more?):
> Whether we show a diff for it (we mark it "binary") and whether it's
> split into a loose object.
>
> Now it's three things, we've added a "this is a threshold when we'll
> stream the object" to that.
>
> Might it make sense to squash something like this in, so we can have our
> cake & eat it too?
>
> With this I get, where HEAD~0 is this change:
>
>     Summary
>       './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~0' ran
>         1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~1'
>         1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'origin/master'
>         1.01 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~0'
>         1.06 ± 0.14 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'origin/master'
>         1.20 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~1'
>
> I.e. it's 5% slower, not 20% (haven't looked into why), but we'll not
> stream out 16k..128MB objects (maybe the repo has even bigger ones?)
>
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index c04f62a54a1..601b7a2418f 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be.
>  +
>  Common unit suffixes of 'k', 'm', or 'g' are supported.
>
> +core.bigFileStreamingThreshold::
> +       Files larger than this will be streamed out to a temporary
> +       object file while being hashed, which will when be renamed
> +       in-place to a loose object, particularly if the
> +       `core.bigFileThreshold' setting dictates that they're always
> +       written out as loose objects.
> ++
> +Default is 128 MiB on all platforms.
> ++
> +Common unit suffixes of 'k', 'm', or 'g' are supported.
> +
>  core.excludesFile::
>         Specifies the pathname to the file that contains patterns to
>         describe paths that are not meant to be tracked, in addition
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index bedc494e2db..94ce275c807 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -400,7 +400,7 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
>         void *buf;
>
>         /* Write large blob in stream without allocating full buffer. */
> -       if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) {
> +       if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) {
>                 write_stream_blob(nr, size);
>                 return;
>         }
> diff --git a/cache.h b/cache.h
> index eba12487b99..4037c7fd849 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -964,6 +964,7 @@ extern size_t packed_git_window_size;
>  extern size_t packed_git_limit;
>  extern size_t delta_base_cache_limit;
>  extern unsigned long big_file_threshold;
> +extern unsigned long big_file_streaming_threshold;
>  extern unsigned long pack_size_limit_cfg;
>
>  /*
> diff --git a/config.c b/config.c
> index c5873f3a706..7b122a142a8 100644
> --- a/config.c
> +++ b/config.c
> @@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>                 return 0;
>         }
>
> +       if (!strcmp(var, "core.bigfilestreamingthreshold")) {
> +               big_file_streaming_threshold = git_config_ulong(var, value);
> +               return 0;
> +       }
> +
>         if (!strcmp(var, "core.packedgitlimit")) {
>                 packed_git_limit = git_config_ulong(var, value);
>                 return 0;
> diff --git a/environment.c b/environment.c
> index 9da7f3c1a19..4fcc3de7417 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -46,6 +46,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
>  size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
>  size_t delta_base_cache_limit = 96 * 1024 * 1024;
>  unsigned long big_file_threshold = 512 * 1024 * 1024;
> +unsigned long big_file_streaming_threshold = 128 * 1024 * 1024;
>  int pager_use_color = 1;
>  const char *editor_program;
>  const char *askpass_program;

I'm not sure if we need an additional "core.bigFileStreamingThreshold"
here, because "core.bigFileThreshold" has been widely used in
"index-pack", "read_object" and so on.

In the test case which uses "core.bigFileStreamingThreshold" instead of
"core.bigFileThreshold", I found the test case execution failed because
of "fsck", who tried to allocate 15MB of memory.
In the process of "fsck_loose()", "read_loose_object()" will be called,
which contains the following content:

  if (*oi->typep == OBJ_BLOB && *size> big_file_threshold) {
    if (check_stream_oid(&stream, hdr, *size, path, expected_oid) <0)
    goto out;
  } else {
    /* this will allocate 15MB of memory */
    *contents = unpack_loose_rest(&stream, hdr, *size, expected_oid);
    ...
  }

The same case can be found in "unpack_entry_data()":

  static char fixed_buf[8192];
  ...
  if (type == OBJ_BLOB && size > big_file_threshold)
    buf = fixed_buf;
  else
    buf = xmallocz(size);
 ...

Although I know that setting a "core.bigfilethreshold" smaller than the
default value on the server side does not help me prevent users from
creating large delta objects on the client side, it can still
effectively help me reduce the Memory allocation in "receive-pack".

If this is not the correct way to use "core.bigfilethreshold", maybe
you can share some better solutions to me, if you want.

Thanks.
-Han Xin