Re: [PATCH 05/10] git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 27 Feb 2025 19:31:01 -0500

On Wed, Feb 26, 2025 at 05:26:04AM -0800, Junio C Hamano wrote:
> Jeff King <peff@xxxxxxxx> writes:
>
> > But these do not correspond when we see Z_NEED_DICT! Zlib consumes the
> > bytes from the input buffer but it does not increment total_in. And so
> > we hit the BUG("total_in mismatch") call.
> >
> > There are a few options here:
> >
> >   - We could ditch that BUG() check. It is making too many assumptions
> >     about how zlib updates these values. But it does have value in most
> >     cases as a sanity check on the values we're copying.
> >
> >   - We could skip the zlib_post_call() entirely when we see Z_NEED_DICT.
> >     We know that it's hard error for us, so we should just send the
> >     status up the stack and let the caller bail.
> >
> >     The downside is that if we ever did want to support dictionaries,
> >     we couldn't (the git_zstream will be out of sync, since we never
> >     copied its values back from the z_stream).
> >
> >   - We could continue to call zlib_post_call(), but skip just that BUG()
> >     check if the status is Z_NEED_DICT. This keeps git_inflate() as a
> >     thin wrapper around inflate(), and would let us later support
> >     dictionaries for some calls if we wanted to.
> >
> > This patch uses the third approach. It seems like the least-surprising
> > thing to keep git_inflate() a close to inflate() as possible. And while
> > it makes the diff a bit larger (since we have to pass the status down to
> > to the zlib_post_call() function), it's a static local function, and
> > every caller by definition will have just made a zlib call (and so will
> > have a status integer).
>
> Ouch.  I am not sure if I would have made the choice you guys made
> if I were writing this fix, but that is mostly because I do not
> anticipate that we would ever support NEED_DICT and the other two
> options seem simpler, and certainly not because I have any concrete
> reason to oppose the third approach.

It's been a few weeks since I last looked at this patch, but I have a
vague recollection that we chose the second approach while discussing
this together.

And indeed, looking at the copy I have from that session, it looks like
we did

--- 8< ---

diff --git a/git-zlib.c b/git-zlib.c
index 651dd9e07c..265d3074e1 100644
--- a/git-zlib.c
+++ b/git-zlib.c
@@ -122,6 +122,8 @@ int git_inflate(git_zstream *strm, int flush)
                                 ? 0 : flush);
                if (status == Z_MEM_ERROR)
                        die("inflate: out of memory");
+               if (status == Z_NEED_DICT)
+                       break;
                zlib_post_call(strm);

                /*
--- >8 ---

I actually have a vague preference towards that approach, since I too do
not anticipate that we'd ever need/want to support Z_NEED_DICT (and if
we did, we could always change from option (2) to (3) later on).
Likewise, the resulting diff is considerably smaller by line count,
though on the other hand this diff is not all that more complicated
overall.

I don't have a strong enough preference to suggest you make any changes
here, but I thought I'd mention it nonetheless.

> > +test_expect_success 'object reading handles zlib dictionary' - <<\EOT
> > +	echo 'content that will be recompressed' >file &&
> > +	blob=$(git hash-object -w file) &&
> > +	objpath=.git/objects/$(test_oid_to_path "$blob") &&
> > +
> > +	# Recompress a loose object using a precomputed zlib dictionary.
> > +	# This was originally done with:
> > +	#
> > +	#  perl -MCompress::Raw::Zlib -e '
> > +	#    binmode STDIN;
> > +	#    binmode STDOUT;
> > +	#    my $data = do { local $/; <STDIN> };
> > +	#    my $in = new Compress::Raw::Zlib::Inflate;
> > +	#    my $de = new Compress::Raw::Zlib::Deflate(
> > +	#      -Dictionary => "anything"
> > +	#    );
> > +	#    $in->inflate($data, $raw);
> > +	#    $de->deflate($raw, $out);
> > +	#    print $out;
> > +	#  ' <obj.bak >$objpath
> > +	#
> > +	# but we do not want to require the perl module for all test runs (nor
> > +	# carry a custom t/helper program that uses zlib features we don't
> > +	# otherwise care about).
> > +	mv "$objpath" obj.bak &&
> > +	test_when_finished 'mv obj.bak "$objpath"' &&
> > +	printf '\170\273\017\112\003\143' >$objpath &&
> > +
> > +	test_must_fail git cat-file blob $blob 2>err &&
> > +	test_grep 'error: inflate: needs dictionary' err
> > +EOT
> > +
>
> Wheee.  I guess an ugly and down-to-bit-sequence test is much better
> than having no test at all ;-)

As a fun aside, Peff wrote this Perl script (unsurprisingly, I strongly
dislike writing Perl), and amazingly it worked the first try when we
were still hypothesizing about what this bug would look like. It was
quite the surprise to both of us, I think, that it worked so well.

Thanks,
Taylor