Re: [PATCH] Add --show-size to git log to print message size

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 14 Jul 2007 12:03:52 -0700

"Marco Costalba" <mcostalba@xxxxxxxxx> writes:

> Print message size just before the corresponding message
> to speedup the parsing by scripts/porcelains tools.

> diff --git a/log-tree.c b/log-tree.c
> index 8624d5a..2fb7761 100644
> --- a/log-tree.c
> +++ b/log-tree.c
> @@ -295,6 +295,9 @@ void show_log(struct rev_info *opt,
> 	if (opt->add_signoff)
>  		len = append_signoff(&msgbuf, &msgbuf_len, len,
> 				     opt->add_signoff);
> + 	if (opt->show_size)
> +		printf("size %i\n", len);
> +
>  	printf("%s%s%s", msgbuf, extra, sep);
>  	free(msgbuf);
> }

"size" is a bit vague here.  What if we later want to extend
things so that you can ask for the entire log entry size
including the patch output part (I am not saying that would be
an easy change --- I am more worried about the stability of the
external interface).  So is --show-"size".  "message-size" would
have been a bit easier to swallow, but I sense the problem runs
deeper.

The current code spits out a log message after formatting it in
its entirety in core, so we happen to have its size upfront.
Having to say the size upfront means we close the door for
alternative implementations that stream the log formatting
processing.

This is not a problem for log messages per-se, as we
traditionally even did not show a commit log over 16kB (these
days we are supposed to be unbounded, although I do not know if
anybody actually tested that).  But if we ever want to extend
this concept to cover the patch part, so that the reader can
split the "git log" output stream into individual commits with
the same "efficiency improvements" you are seeing from this
patch, that becomes a real problem, I would think.

Naturally, this reminds me of having to say Content-Length
upfront vs chunked transfer.  Essentially you are treating the
output stream from "git log" into the pipe as a sequence of
messages (and without "-p", your "size" is exactly what a
"Content-Length" header is).  The fact that this analogy works
only when the command is run without "-p" (but "--show-size"
does not check that) bothers me.  What would we do when we want
to help the readers that reads from "-p" output?

I have a more basic question. If you are reading from non "-p"
output, where do you exactly have the wasted cycles in your
reader's processing?  One immediately obvious thing is that you
would not have to repeatedly realloc your buffer to keep one
message worth of data in core, but somehow I cannot imagine that
that is the source of a huge performance boost.

One use case that this would give a huge improvement I can think
of is if you read the stream, and show only every tenth commit.
You can discard other 9 out of 10 without even looking at their
contents, and being able to read known amount of bytes and
immediately discard would certainly be much more efficient than
having to scan for NUL, only to discard.  But that does not
sound as a plausible real-life scenario.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html