Re: [PATCH v13 1/7] unpack-objects: low memory footprint for get_data() in dry_run mode

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 06 Jun 2022 11:35:45 -0700

Ævar Arnfjörð Bjarmason  <avarab@xxxxxxxxx> writes:

> From: Han Xin <hanxin.hx@xxxxxxxxxxxxxxx>
>
> As the name implies, "get_data(size)" will allocate and return a given
> amount of memory. Allocating memory for a large blob object may cause the
> system to run out of memory. Before preparing to replace calling of
> "get_data()" to unpack large blob objects in latter commits, refactor
> "get_data()" to reduce memory footprint for dry_run mode.
>
> Because in dry_run mode, "get_data()" is only used to check the
> integrity of data, and the returned buffer is not used at all, we can
> allocate a smaller buffer and reuse it as zstream output. Therefore,

"reuse" -> "use"

> in dry_run mode, "get_data()" will release the allocated buffer and
> return NULL instead of returning garbage data.

It makes it sound as if we used to return garbage data, but I do not
think that is what happened in reality.  Perhaps rewrite the last
sentence like

	Make the function return NULL in the dry-run mode, as no
	callers use the returned buffer.

or something?

The overall logic sounds quite sensible.

> The "find [...]objects/?? -type f | wc -l" test idiom being used here
> is adapted from the same "find" use added to another test in
> d9545c7f465 (fast-import: implement unpack limit, 2016-04-25).

> +/*
> + * Decompress zstream from stdin and return specific size of data.

"specific size"?  The caller specifies the size of data (because it
knows a-priori how many bytes the zstream should inflate to), so

    Decompress zstream from the standard input into a newly
    allocated buffer of specified size and return the buffer.

or something, perhaps.  In any case, it needs to say that the caller
is responsible for giving the "right" size.

> + * The caller is responsible to free the returned buffer.
> + *
> + * But for dry_run mode, "get_data()" is only used to check the
> + * integrity of data, and the returned buffer is not used at all.
> + * Therefore, in dry_run mode, "get_data()" will release the small
> + * allocated buffer which is reused to hold temporary zstream output
> + * and return NULL instead of returning garbage data.
> + */
>  static void *get_data(unsigned long size)
>  {
>  	git_zstream stream;
> -	void *buf = xmallocz(size);
> +	unsigned long bufsize = dry_run && size > 8192 ? 8192 : size;
> +	void *buf = xmallocz(bufsize);

OK.

>  	memset(&stream, 0, sizeof(stream));
>  
>  	stream.next_out = buf;
> -	stream.avail_out = size;
> +	stream.avail_out = bufsize;
>  	stream.next_in = fill(1);
>  	stream.avail_in = len;
>  	git_inflate_init(&stream);
> @@ -125,8 +136,15 @@ static void *get_data(unsigned long size)

What's hidden in the pre-context is this bit:

		int ret = git_inflate(&stream, 0);
		use(len - stream.avail_in);
		if (stream.total_out == size && ret == Z_STREAM_END)
			break;
		if (ret != Z_OK) {
			error("inflate returned %d", ret);
			FREE_AND_NULL(buf);
			if (!recover)
				exit(1);
			has_errors = 1;
			break;
		}

and it is correct to use "size", not "bufsize", for this check.
Unless we receive exactly the caller-specified "size" bytes from the
inflated zstream with Z_STREAM_END, we want to detect an error and
bail out.

I am not sure if this is not loosening the error checking in the
dry-run case, though.  In the original code, we set the avail_out
to the total expected size so

 (1) if the caller gives too small a size, git_inflate() would stop
     at stream.total_out with ret that is not STREAM_END nor OK,
     bypassing the "break", and we catch the error.

 (2) if the caller gives too large a size, git_inflate() would stop
     at the true size of inflated zstream, with STREAM_END and would
     not hit this "break", and we catch the error.

With the new code, since we keep refreshing avail_out (see below),
git_inflate() does not even learn how many bytes we are _expecting_
to see.  Is the error checking in the loop, with the updated code,
catch the mismatch between expected and actual size (plausibly
caused by a corrupted zstream) the same way as we do in the 
non dry-run code path?

>  		}
>  		stream.next_in = fill(1);
>  		stream.avail_in = len;
> +		if (dry_run) {
> +			/* reuse the buffer in dry_run mode */
> +			stream.next_out = buf;
> +			stream.avail_out = bufsize;
> +		}
>  	}
>  	git_inflate_end(&stream);
> +	if (dry_run)
> +		FREE_AND_NULL(buf);
>  	return buf;
>  }