Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 16 Sep 2016 12:12:50 -0700

Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).

The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.

        /*
         * The mark must be at least 8 bytes long (e.g. "-- >8 --").
         * Even though there can be arbitrary cruft on the same line
         * (e.g. "cut here"), in order to avoid misidentification, the
         * perforation must occupy more than a third of the visible
         * width of the line, and dashes and scissors must occupy more
         * than half of the perforation.
         */

Even though it is not likely for people to do so, it would probably
be nicer if we can treat

	From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --

as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.

But this is a minor point.  It is not worth to make it work like so
if the resulting code will become messier.

> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
>   o passes UTF-8 conversion
>     o [described above] is scissors line
>     o [not possible] is patchbreak line
>     o [not possible] is blank line
>     o is none of the above - processed as header
>   o fails UTF-8 conversion - processed as header
> o not in-body header
>   o passes UTF-8 conversion
>     o is scissors line - processed as scissors
>     o is patchbreak line - processed as patchbreak
>     o is blank line - ignored if in header_stage
>     o is none of the above - log message
>   o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
> ---
>  mailinfo.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>  	return out;
>  }
>  
> -static int convert_to_utf8(struct mailinfo *mi,
> -			   struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf *line,
> +			       const char *charset, char **old_buf,
> +			       size_t *old_len)
>  {
> -	char *out;
> +	char *utf8;
>  
> -	if (!mi->metainfo_charset || !charset || !*charset)
> +	if (!mi->metainfo_charset || !charset || !*charset ||
> +	    same_encoding(mi->metainfo_charset, charset)) {
> +		if (old_buf)
> +			*old_buf = NULL;
>  		return 0;
> +	}
>  
> -	if (same_encoding(mi->metainfo_charset, charset))
> +	utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> +	if (utf8) {
> +		char *temp = strbuf_detach(line, old_len);
> +		if (old_buf)
> +			*old_buf = temp;
> +		strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
>  		return 0;
> -	out = reencode_string(line->buf, mi->metainfo_charset, charset);
> -	if (!out) {
> +	}
> +	return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> +			   struct strbuf *line, const char *charset)
> +{
> +	if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
>  		mi->input_error = -1;
>  		return error("cannot convert from %s to %s",
>  			     charset, mi->metainfo_charset);
>  	}
> -	strbuf_attach(line, out, strlen(out), strlen(out));
>  	return 0;
>  }

Please split this part into its own patch.  IIUC, it moves the meat
of convert_to_utf8() to a more silent try_convert_to_utf8() and then
makes the former a thin wrapper of the latter.  Which by itself is a
good change but does not have anything to do with "fix handling of
the in-body headers", other than that the main fix wants to have
such a more silent helper for its own use.

> @@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
>  	return ret;
>  }
>  
> +static int check_header_raw(struct mailinfo *mi,
> +			    char *buf, size_t len,
> +			    struct strbuf *hdr_data[], int overwrite) {
> +	const struct strbuf sb = {0, len, buf};
> +	return check_header(mi, &sb, hdr_data, overwrite);
> +}

IIUC, this is a helper for callers that do not have a strbuf but
instead have <buf, len> pair to perform the same check_header() the
callers that have strbuf can do.

As check_header() uses the strbuf as a read-only entity, wrapping
the <buf, len> pair in a temporary strbuf like this is safe.

The incoming <buf> should conceptually be "const char *", but it's
OK.

If check_header() didn't call any helper function that gets passed
&sb as a strbuf, or if convertiong the helper function to take a
<buf, len> pair instead, I would actually suggest refactoring this
the other way around, though.  That is, move the implementation of
check_header() to this function, updating its reference to line->buf
and line->len to reference to <buf> and <len>, and then make
check_header() a thin wrapper that does

        check_header(mi, const struct strbuf *line,
                     struct strbuf *hdr_data[], int overwrite)
        {
                return check_header_raw(mi, line->buf, line->len,
                                        hdr_data, overwrite);
        }

I didn't check how involved to update cmp_header() to take <buf,len>
pair.  If it does not look too bad, then I think I would prefer to
do it that way, and as before, make that conversion a separate
preparatory patch.

> @@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
>  		gap * 2 < perforation);
>  }
>  
> -static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +static int resembles_rfc2822_header(const struct strbuf *line)
>  {
> -	assert(!mi->filter_stage);
> +	char *c;
>  
> -	if (mi->header_stage) {
> -		if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
> +	if (!isalpha(line->buf[0]))
> +		return 0;
> +
> +	for (c = line->buf + 1; *c != 0; c++) {
> +		if (*c == ':')
> +			return 1;
> +		else if (*c != '-' && !isalpha(*c))
>  			return 0;
>  	}
> +	return 0;
> +}

Is this helper supposed to handle any rfc2822 looking header, or
only the ones we expect to see as in-body header?

> -	if (mi->use_inbody_headers && mi->header_stage) {
> -		mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
> -		if (mi->header_stage)
> -			return 0;
> -	} else
> -		/* Only trim the first (blank) line of the commit message
> -		 * when ignoring in-body headers.
> -		 */
> -		mi->header_stage = 0;
> +static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +{
> +	int ret = 0;
> +	int utf8_result;
> +	char *old_buf;
> +	size_t old_len;
> +
> +	assert(!mi->filter_stage);
>  
> -	/* normalize the log message to UTF-8. */
> -	if (convert_to_utf8(mi, line, mi->charset.buf))
> -		return 0; /* mi->input_error already set */
> +	/*
> +	 * Obtain UTF8 for scissors line and patchbreak checks, but retain the
> +	 * undecoded line in case we need to process it as an in-body header.
> +	 */
> +	utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
> +					  &old_len);

Just a minor style suggestion.  As <old_buf, old_len> come in a
pair, fold the line before them, so that the readers can easily
see the association between them.  I.e.

        utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf,
                                          &old_buf, &old_len);

> -	if (mi->use_scissors && is_scissors_line(line)) {
> +	if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
>  		int i;
>  
> +		if (resembles_rfc2822_header(line))
> +			/*
> +			 * Explicitly reject scissor lines that resemble a RFC
> +			 * 2822 header, to avoid being prone to error.
> +			 */
> +			die("scissors line resembles RFC 2822 header");
> +

I guess "disambiguate to favor scissors" is not that difficult ;-)

>  		strbuf_setlen(&mi->log_message, 0);
>  		mi->header_stage = 1;
>  
> @@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
>  				strbuf_release(mi->s_hdr_data[i]);
>  			mi->s_hdr_data[i] = NULL;
>  		}
> -		return 0;
> +		goto handle_commit_msg_out;
>  	}
> -
> -	if (patchbreak(line)) {
> +	if (!utf8_result && patchbreak(line)) {
>  		if (mi->message_id)
>  			strbuf_addf(&mi->log_message,
>  				    "Message-Id: %s\n", mi->message_id);
> -		return 1;
> +		ret = 1;
> +		goto handle_commit_msg_out;
>  	}
>  
> +	if (mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
> +			goto handle_commit_msg_out;
> +	}
> +
> +	if (mi->use_inbody_headers && mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		size_t len = old_buf ? old_len : line->len;
> +		mi->header_stage = check_header_raw(mi, buf, len,
> +						    mi->s_hdr_data, 0);

Just a minor comment, but I guess check_header_raw() refactoring is
not strictly needed after all, as this callsite can wrap <buf,len>
into a temporary strbuf.

Unlike the real header that is read in read_one_header_line() inside
a loop to implement line folding before check_header() is called, we
call check_header() before possibly-foldable header lines is fully
assembled into one header.  Probably it comes in later patches, I
guess.

It is not immediately obvious to me how this step helps further work
done by later patches in the series until I read them, but so far
what this patch did looks understandable to me ;-)

Thanks.

> +		if (mi->header_stage)
> +			goto handle_commit_msg_out;
> +	} else
> +		/* Only trim the first (blank) line of the commit message
> +		 * when ignoring in-body headers.
> +		 */
> +		mi->header_stage = 0;
> +
> +	/* If adding as a log message, conversion to UTF-8 is required. */
> +	if (utf8_result) {
> +		mi->input_error = -1;
> +		error("cannot convert from %s to %s",
> +		      mi->charset.buf, mi->metainfo_charset);
> +		goto handle_commit_msg_out;
> +	}
>  	strbuf_addbuf(&mi->log_message, line);
> -	return 0;
> +
> +handle_commit_msg_out:
> +	free(old_buf);
> +	return ret;
>  }
>  
>  static void handle_patch(struct mailinfo *mi, const struct strbuf *line)