Re: [PATCH 2/3] parse_commit(): parse timestamp from end of line

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 24 Apr 2023 10:05:42 -0700

Jeff King <peff@xxxxxxxx> writes:

> To find the committer timestamp, we parse left-to-right looking for the
> closing ">" of the email, and then expect the timestamp right after
> that. But we've seen some broken cases in the wild where this fails, but
> we _could_ find the timestamp with a little extra work. E.g.:
>
>   Name <Name<email>> 123456789 -0500
> ...
> So let's use the same trick as 03818a4a94: find the end of the line, and
> parse back to the final ">".

This obviously assumes that even in a broken ident, it is very
likely that the second component (where <e-mail> usually sits) ends
with a '>'.  Given that we enclose whatever garbage the end user
gave us as their e-mail inside a pair of <> ourselves, it is a very
sensible assumption, I would think.  The original parser assumed
that the end user would not have '>' inside their e-mail part of the
ident, which turns out to be more problematic than the alternative
being proposed.  It is doubly good that we already parse from the
end elsewhere.

Nice.

> +	/*
> +	 * parse to end-of-line and then walk backwards, which
> +	 * handles some malformed cases.
> +	 */

I would say "parse to" -> "jump to", but technically moving forward
looking for a LF byte is still "parsing".  "some" malformed cases
being "most plausible" ones (due to how ident.c::fmt_ident() is what
writes '>' after the string end-user gave as e-mail) may be worth
mentioning.

> +	eol = memchr(buf, '\n', tail - buf);
> +	if (!eol)
>  		return 0;

OK.

> -	dateptr = buf;
> -	while (buf < tail && *buf++ != '\n')
> +	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
>  		/* nada */;

OK.  Just a style thing, but I found that "; /* nada */" is easier
to spot that there is an empty statement there.

> -	if (buf >= tail)
> +	if (dateptr == buf || dateptr == eol)
>  		return 0;

Curious when dateptr that wanted to scan back from eol is still at
eol after the loop.  It is when the ident line ends with ">" without
any timestamp/tz info.   And the reason why we need to check that
here is ...

> -	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
> +
> +	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
>  	return parse_timestamp(dateptr, NULL, 10);

... because parse_timestamp() is merely strtoumax() and would
happily skip over arbitrary number of leading "whitespace" without
stopping if (dateptr == eol && *eol == '\n').  OK, sad but correct.