Re: [PATCH 1/7] oid_array: use size_t for count and allocation

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 14 Apr 2020 18:27:58 -0600

On Mon, Mar 30, 2020 at 10:03:09AM -0400, Jeff King wrote:
> The oid_array object uses an "int" to store the number of items and the
> allocated size. It's rather unlikely for somebody to have more than 2^31
> objects in a repository (the sha1's alone would be 40GB!), but if they
> do, we'd overflow our alloc variable.
>
> You can reproduce this case with something like:
>
>   git init repo
>   cd repo
>
>   # make a pack with 2^24 objects
>   perl -e '
>     my $nr = 2**24;
>
>     for (my $i = 0; $i < $nr; $i++) {
> 	    print "blob\n";
> 	    print "data 4\n";
> 	    print pack("N", $i);
>     }
>   ' | git fast-import
>
>   # now make 256 copies of it; most of these objects will be duplicates,
>   # but oid_array doesn't de-dup until all values are read and it can
>   # sort the result.
>   cd .git/objects/pack/
>   pack=$(echo *.pack)
>   idx=$(echo *.idx)
>   for i in $(seq 0 255); do
>     # no need to waste disk space
>     ln "$pack" "pack-extra-$i.pack"
>     ln "$idx" "pack-extra-$i.idx"
>   done
>
>   # and now force an oid_array to store all of it
>   git cat-file --batch-all-objects --batch-check
>
> which results in:
>
>   fatal: size_t overflow: 32 * 18446744071562067968
>
> So the good news is that st_mult() sees the problem (the large number is
> because our int wraps negative, and then that gets cast to a size_t),
> doing the job it was meant to: bailing in crazy situations rather than
> causing an undersized buffer.
>
> But we should avoid hitting this case at all, and instead limit
> ourselves based on what malloc() is willing to give us. We can easily do
> that by switching to size_t.
>
> The cat-file process above made it to ~120GB virtual set size before the
> integer overflow (our internal hash storage is 32-bytes now in
> preparation for sha256, so we'd expect ~128GB total needed, plus
> potentially more to copy from one realloc'd block to another)). After
> this patch (and about 130GB of RAM+swap), it does eventually read in the
> whole set. No test for obvious reasons.

;). This patch looks good, and makes immediate sense given your
explanation.

> Signed-off-by: Jeff King <peff@xxxxxxxx>
> ---
>  sha1-array.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/sha1-array.h b/sha1-array.h
> index dc1bca9c9a..c5e4b9324f 100644
> --- a/sha1-array.h
> +++ b/sha1-array.h
> @@ -49,8 +49,8 @@
>   */
>  struct oid_array {
>  	struct object_id *oid;
> -	int nr;
> -	int alloc;
> +	size_t nr;
> +	size_t alloc;
>  	int sorted;
>  };
>
> --
> 2.26.0.597.g7e08ed78ff

Thanks,
Taylor