Re: [PATCH v4 2/2] sha1_file: refactor sha1_file.c to support 'cat-file --literally'

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 17 Mar 2015 13:10:31 -0700

Karthik Nayak <karthik.188@xxxxxxxxx> writes:

> Subject: Re: [PATCH v4 2/2] sha1_file: refactor sha1_file.c to support 'cat-file --literally'

> Modify sha1_loose_object_info() to support 'cat-file --literally'
> by accepting flags and also make changes to copy the type to
> object_info::typename.

It would be more descriptive to mention the central point on the
title regarding what it takes to "support cat-file --literally".

For example:

  sha1_file.c: support reading from a loose object of a bogus type

  Update sha1_loose_object_info() to optionally allow it to read
  from a loose object file of unknown/bogus type; as the function
  usually returns the type of the object it read in the form of enum
  for known types, add an optional "typename" field to receive the
  name of the type in textual form.

By the way, I think your 1/2 would not even compile unless you have
this change; the patches in these two patch series must be swapped,
no?

> diff --git a/sha1_file.c b/sha1_file.c
> index 69a60ec..e31e9e2 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1564,6 +1564,36 @@ int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned long ma
>  	return git_inflate(stream, 0);
>  }
>  
> +static int unpack_sha1_header_literally(git_zstream *stream, unsigned char *map,
> +					unsigned long mapsize,
> +					struct strbuf *header)
> +{
> +	unsigned char buffer[32], *cp;
> +	unsigned long bufsiz = sizeof(buffer);
> +	int status;
> +
> +	/* Get the data stream */
> +	memset(stream, 0, sizeof(*stream));
> +	stream->next_in = map;
> +	stream->avail_in = mapsize;
> +	stream->next_out = buffer;
> +	stream->avail_out = bufsiz;
> +
> +	git_inflate_init(stream);
> +
> +	do {
> +		status = git_inflate(stream, 0);
> +		strbuf_add(header, buffer, stream->next_out - buffer);
> +		for (cp = buffer; cp < stream->next_out; cp++)
> +			if (!*cp)
> +				/* Found the NUL at the end of the header */
> +				return 0;
> +		stream->next_out = buffer;
> +		stream->avail_out = bufsiz;
> +	} while (status == Z_OK);
> +	return -1;
> +}
> +

There is nothing "literal" about this function.

The only difference between the original and this one is that this
uses a strbuf, instead of a buffer of a fixed size allocated by the
caller, to return the early part of the inflated data.

I wonder if it incurs too much overhead to the existing callers if
you reimplementated unpack_sha1_header() as a thin wrapper around
this function, something like

	int unpack_sha1_header(git_zstream *stream,
				unsigned char *map, unsigned long mapsize,
                                void *buffer, unsigned long bufsiz)
	{
                int status;
		struct strbuf header = STRBUF_INIT;

                status = unpack_sha1_header_to_strbuf(stream, map, mapsize, &header);
		if (bufsiz < header.len)
                	die("object header too long");
		memcpy(buffer, header.buf, header.len);
                return status;
	}

Note that I am *not* suggesting to do this blindly.  If there is no
downside from performance point of view, however, the above would be
a way to avoid code duplication.

Another way to avoid code duplication may be to implement
unpack_sha1_header_to_strbuf() in terms of unpack_sha1_header(),
perhaps like this:

	static int unpack_sha1_header_to_strbuf(...)
        {
		unsigned char buffer[32], *cp;
                unsigned long bufsiz = sizeof(buffer);
        	int status = unpack_sha1_header(stream, map, mapsize, buffer, bufsiz);

		if (status)
                	return status;
		do {
			strbuf_add(header, buffer, stream->next_out - buffer);
			for (cp = buffer; cp < stream->next_out; cp++)
				if (!*cp)
					/* Found the NUL at the end of the header */
					return 0;
			stream->next_out = buffer;
			stream->avail_out = bufsiz;
		} while (status == Z_OK);
		return -1;
	}

which may be more in line with how we read data from loose objects.

> +int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
> +			       int flags)

Unless we are taking advantage of the fact that MSB is special in
signed integral types, I would prefer to see us use an unsigned type
to store these bits in a variable of an integral type.  That would
let the readers assume that they have fewer tricky things possibly
going on in the code (also see the footnote to $gmane/263751).

> @@ -1652,12 +1674,45 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  			size = size * 10 + c;
>  		}
>  	}
> -	*sizep = size;
> +
> +	type = type_from_string_gently(typename.buf, -1, 1);

Doesn't the code know how long the typename is at this point?

> +	if (oi->sizep)
> +		*oi->sizep = size;
> +	if (oi->typename)
> +		strbuf_addstr(oi->typename, typename.buf);

Likewise.  Perhaps strbuf_addbuf()?

> +	if (oi->typep)
> +		*oi->typep = type;

Do you want to store -1 to *oi->typep when the caller asked to do
the "literally" thing?  Shouldn't it match what is returned from
this function?

> +	return *hdr ? -1 : type;
> +}
> +
> +/*
> + * We used to just use "sscanf()", but that's actually way
> + * too permissive for what we want to check. So do an anal
> + * object header parse by hand. Calls the extended function.
> + */

The comment "let's do better than sscanf() by parsing ourselves"
applies to the implementation of _extended() function, not this one,
no?  It is clear to everybody that it "Calls the extended function",
so why mention it?

> +int parse_sha1_header(const char *hdr, unsigned long *sizep)
> +{
> +	struct object_info oi;
> +
> +	oi.sizep = sizep;
> +	oi.typename = NULL;
> +	oi.typep = NULL;
> +	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
>  }

> @@ -2524,13 +2579,15 @@ struct packed_git *find_sha1_pack(const unsigned char *sha1,
>  }
>  
>  static int sha1_loose_object_info(const unsigned char *sha1,
> -				  struct object_info *oi)
> +				  struct object_info *oi,
> +				  int flags)
>  {
> -	int status;
> +	int status = 0;
>  	unsigned long mapsize, size;
>  	void *map;
>  	git_zstream stream;
>  	char hdr[32];
> +	struct strbuf hdrbuf = STRBUF_INIT;
>  
>  	if (oi->delta_base_sha1)
>  		hashclr(oi->delta_base_sha1);
> @@ -2557,17 +2614,29 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>  		return -1;
>  	if (oi->disk_sizep)
>  		*oi->disk_sizep = mapsize;
> -	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
> -		status = error("unable to unpack %s header",
> -			       sha1_to_hex(sha1));
> -	else if ((status = parse_sha1_header(hdr, &size)) < 0)
> -		status = error("unable to parse %s header", sha1_to_hex(sha1));
> -	else if (oi->sizep)
> +	if ((flags & LOOKUP_LITERALLY)) {
> +		if (unpack_sha1_header_literally(&stream, map, mapsize, &hdrbuf) < 0)
> +			status = error("unable to unpack %s header with --literally",
> +				       sha1_to_hex(sha1));
> +		else if ((status = parse_sha1_header_extended(hdrbuf.buf, oi, flags)) < 0)
> +			status = error("unable to parse %s header", sha1_to_hex(sha1));
> +	} else {
> +		if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
> +			status = error("unable to unpack %s header",
> +				       sha1_to_hex(sha1));
> +		else if ((status = parse_sha1_header(hdr, &size)) < 0)
> +			status = error("unable to parse %s header", sha1_to_hex(sha1));
> +	}
> +	if (oi->sizep)
>  		*oi->sizep = size;

Have you considered calling parse_sha1_header_extended() for both
literally and normal cases?  Then you wouldn't have to do any of the
"if (oi->blah) *oi->blah = value", no?

>  	git_inflate_end(&stream);
>  	munmap(map, mapsize);
>  	if (oi->typep)
>  		*oi->typep = status;

Likewise.

> +	if (oi->typename && !(oi->typename->len))
> +		strbuf_addstr(oi->typename, typename(status));

Likewise.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html