Re: [PATCH v3 11/17] object-file.c: stop dying in parse_loose_header()

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 27 May 2021 10:50:39 -0700

> Start the libification of parse_loose_header() by making it return
> error codes and data instead of invoking die() by itself. For now
> we'll move the relevant die() call to loose_object_info() and
> read_loose_object() to keep this change smaller, but in subsequent
> commits we'll also libify those.
> 
> The reason this makes sense is that with the refactoring of
> parse_loose_header_extended() in an earlier commit the public
> interface for parse_loose_header() no longer just accepts a "unsigned
> long *sizep". Rather it accepts a "struct object_info *", that
> structure will be populated with information about the object.
> 
> It thus makes sense to further libify the interface so that it stops
> calling die() when it encounters OBJ_BAD, and instead rely on its
> callers to check the populated "oi->typep".
> 
> This also allows us to simplify away the
> unpack_loose_header_to_strbuf() function added in
> 46f034483eb (sha1_file: support reading from a loose object of unknown
> type, 2015-05-03). Its code was mostly copy/pasted between it and both
> of unpack_loose_header() and unpack_loose_short_header(). We now have
> a single unpack_loose_header() function which accepts an optional
> "struct strbuf *" instead.
> 
> I think the remaining unpack_loose_header() function could be further
> simplified, we're carrying some complexity just to be able to emit a
> garbage type longer than MAX_HEADER_LEN, we could alternatively just
> say "we found a garbage type <first 32 bytes>..." instead, but let's
> leave this in place for now.

Looking at the patch itself, this patch:
 1. Combines unpack_loose_header(), unpack_loose_short_header(), and
    unpack_loose_header_to_strbuf() into one function. It does different
    things depending on whether the struct strbuf * is provided.
 2. Updates parse_loose_header() to:
    a. never die upon invalid object type
    b. not accept a flags argument (which was used solely to control
       whether it died or not upon invalid object type)
 3. Updates the callers of these functions.

I think 2b should have been mentioned in the commit message. (And
overall I think that the commit message could be shorter, but what
information to include and exclude is a subjective matter, so I won't
concern myself so much about that.)

Also, I think that 1 should be split into its own commit. As it is, I'm
not even sure if it's a good idea - for me, it is confusing for a
function to consume or not consume more of the stream depending on
whether an argument is NULL or not.

>  	/*
>  	 * Check if entire header is unpacked in the first iteration.
>  	 */
>  	if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
>  		return 0;
>  
> +	/*
> +	 * We have a header longer than MAX_HEADER_LEN. We abort early
> +	 * unless under we're running as e.g. "cat-file
> +	 * --allow-unknown-type".
> +	 */
> +	if (!header)
> +		return -1;

What do you mean by "unless under we're running as"? And how would we
know at this point that we're running as "cat-file --allow-unknown-type"
just by checking "header"?

> -	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
> -		if (unpack_loose_header_to_strbuf(&stream, map, mapsize, hdr, sizeof(hdr), &hdrbuf) < 0)
> -			status = error(_("unable to unpack %s header with --allow-unknown-type"),
> -				       oid_to_hex(oid));
> -	} else if (unpack_loose_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0) {
> +
> +	hdr_ret = unpack_loose_header(&stream, map, mapsize, hdr, sizeof(hdr),
> +				      allow_unknown ? &hdrbuf : NULL);
> +	if (hdr_ret < 0) {
>  		status = error(_("unable to unpack %s header"),
>  			       oid_to_hex(oid));
>  	}

This hunk would go into the split commit (for the unpack_loose_header()
refactoring) I suggested above.

> -
> -	if (status < 0) {
> -		/* Do nothing */
> -	} else if (hdrbuf.len) {
> -		if ((status = parse_loose_header(hdrbuf.buf, oi, flags)) < 0)
> -			status = error(_("unable to parse %s header with --allow-unknown-type"),
> -				       oid_to_hex(oid));
> -	} else if ((status = parse_loose_header(hdr, oi, flags)) < 0) {
> +	if (!status && parse_loose_header(hdrbuf.len ? hdrbuf.buf : hdr, oi) < 0) {
>  		status = error(_("unable to parse %s header"), oid_to_hex(oid));
>  	}
> +	if (!allow_unknown && *oi->typep < 0)
> +		die(_("invalid object type"));

I think this change doesn't need to be so big? The oi->typep check could
just go in the "else if" branch that happens if --allow-unknown-type is
not set.

> diff --git a/object-store.h b/object-store.h
> index d443964447..740edcac30 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -477,11 +477,30 @@ int for_each_object_in_pack(struct packed_git *p,
>  int for_each_packed_object(each_packed_object_fn, void *,
>  			   enum for_each_object_flags flags);
>  
> +/**
> + * unpack_loose_header() initializes the data stream needed to unpack
> + * a loose object header.
> + *
> + * Returns 0 on success. Returns negative values on error.
> + *
> + * It will only parse up to MAX_HEADER_LEN bytes unless an optional
> + * "hdrbuf" argument is non-NULL. This is intended for use with
> + * OBJECT_INFO_ALLOW_UNKNOWN_TYPE to extract the bad type for (error)
> + * reporting. The full header will be extracted to "hdrbuf" for use
> + * with parse_loose_header().
> + */
>  int unpack_loose_header(git_zstream *stream, unsigned char *map,
>  			unsigned long mapsize, void *buffer,
> +			unsigned long bufsiz, struct strbuf *hdrbuf);

Parsing up to MAX_HEADER_LEN only occasionally is confusing. Could we
always make it parse up to MAX_HEADER_LEN, and then put a truncated
header in hdrbuf if it is too big (and therefore invalid)?

> +/**
> + * parse_loose_header() parses the starting "<type> <len>\0" of an
> + * object. If it doesn't follow that format -1 is returned. To check
> + * the validity of the <type> populate the "typep" in the "struct
> + * object_info". It will be OBJ_BAD if the object type is unknown.
> + */
> +int parse_loose_header(const char *hdr, struct object_info *oi);

Also mention what happens if the size is invalid.