Re: [PATCH v2 0/3] fixing some parse_commit() timestamp corner cases

Jeff King <peff@xxxxxxxx> · Tue, 25 Apr 2023 01:54:42 -0400

On Tue, Apr 25, 2023 at 01:52:45AM -0400, Jeff King wrote:

> Here's a v2 of my series. The behavior should be identical, but I've
> incorporated some comment and small code tweaks based on feedback from
> the first round.
> 
> I also added a fourth patch which adds a new comment explaining some of
> the cases that were alluded to in the earlier round's patch 3.
> 
>   [1/4]: t4212: avoid putting git on left-hand side of pipe
>   [2/4]: parse_commit(): parse timestamp from end of line
>   [3/4]: parse_commit(): handle broken whitespace-only timestamp
>   [4/4]: parse_commit(): describe more date-parsing failure modes
> 
>  commit.c               | 47 +++++++++++++++++++++++++++++++++++-------
>  t/t4212-log-corrupt.sh | 39 +++++++++++++++++++++++++++++++++--
>  2 files changed, 76 insertions(+), 10 deletions(-)

Whoops, forgot my range-diff (though nothing should be too surprising
based on the round 1 discussion):

1:  07932cf666 = 1:  ac38ce133d t4212: avoid putting git on left-hand side of pipe
2:  7ee34c7d5f ! 2:  f59e61262d parse_commit(): parse timestamp from end of line
    @@ Commit message
         parse back to the final ">". In theory we could use split_ident_line()
         here, but it's actually a bit more strict. In particular, it requires a
         valid time-zone token, too. That should be present, of course, but we
    -    wouldn't want to break --until for malformed cases that are working
    -    currently.
    +    wouldn't want to break --until for cases that are working currently.

         We might want to teach split_ident_line() to become more lenient there,
         but it would require checking its many callers (since right now they can
    @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail
     -	if (buf >= tail)
     +
     +	/*
    -+	 * parse to end-of-line and then walk backwards, which
    -+	 * handles some malformed cases.
    ++	 * Jump to end-of-line so that we can walk backwards to find the
    ++	 * end-of-email ">". This is more forgiving of malformed cases
    ++	 * because unexpected characters tend to be in the name and email
    ++	 * fields.
     +	 */
     +	eol = memchr(buf, '\n', tail - buf);
     +	if (!eol)
      		return 0;
     -	dateptr = buf;
     -	while (buf < tail && *buf++ != '\n')
    -+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
    - 		/* nada */;
    +-		/* nada */;
     -	if (buf >= tail)
    ++	dateptr = eol;
    ++	while (dateptr > buf && dateptr[-1] != '>')
    ++		dateptr--;
     +	if (dateptr == buf || dateptr == eol)
      		return 0;
     -	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
3:  e8e94083f5 ! 3:  c62fc59bf1 parse_commit(): handle broken whitespace-only timestamp
    @@ Commit message
         It's not subject to the same bug, because it insists that there be one
         or more digits in the timestamp.

    -    We can use the same logic here. If there's a non-whitespace but
    -    non-digit value (say "committer name <email> foo"), then
    -    parse_timestamp() would already have returned 0 anyway. So the only
    -    change should be for this "whitespace only" case.
    -
         Signed-off-by: Jeff King <peff@xxxxxxxx>

      ## commit.c ##
     @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail)
    - 	if (dateptr == buf || dateptr == eol)
    + 	dateptr = eol;
    + 	while (dateptr > buf && dateptr[-1] != '>')
    + 		dateptr--;
    +-	if (dateptr == buf || dateptr == eol)
    ++	if (dateptr == buf)
      		return 0;

    +-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
     +	/*
    -+	 * trim leading whitespace; parse_timestamp() will do this itself, but
    -+	 * it will walk past the newline at eol while doing so. So we insist
    -+	 * that there is at least one digit here.
    ++	 * Trim leading whitespace; parse_timestamp() will do this itself, but
    ++	 * if we have _only_ whitespace, it will walk right past the newline
    ++	 * while doing so.
     +	 */
     +	while (dateptr < eol && isspace(*dateptr))
     +		dateptr++;
    -+	if (!strchr("0123456789", *dateptr))
    ++	if (dateptr == eol)
     +		return 0;
     +
    - 	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
    ++	/*
    ++	 * We know there is at least one non-whitespace character, so we'll
    ++	 * begin parsing there and stop at worst case at eol.
    ++	 */
      	return parse_timestamp(dateptr, NULL, 10);
      }
    + 

      ## t/t4212-log-corrupt.sh ##
     @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
-:  ---------- > 4:  28ed51a2ca parse_commit(): describe more date-parsing failure modes