Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

Jeff King <peff@xxxxxxxx> · Mon, 19 Sep 2016 21:28:33 -0700

On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:

> diff --git a/t/t5100/comment.expect b/t/t5100/comment.expect
> new file mode 100644
> index 0000000..1197e76
> --- /dev/null
> +++ b/t/t5100/comment.expect
> @@ -0,0 +1,5 @@
> +Author: A U Thor (this is a comment (really))

Hmm. I don't see any recursion in your parsing, so after the first ")"
our escape_context would be 0 again, right? So a more tricky test is:

  Author: A U Thor (this is a comment (really) with \(quoted\) pairs)

We are still inside "ctext" when we hit those quoted pairs, and they
should be unquoted, but your code would not do so (unless we go the
route of simply unquoting pairs everywhere).

I think your parser would have to follow the BNF more closely with a
recursive descent parser, like:

  const char *parse_comment(const char *in, struct strbuf *out)
  {
        size_t orig_out = out->len;

        if ((in = parse_char('(', in, out))) &&
            (in = parse_ccontent(in, out)) &&
            (in = parse_char(')', in, out))))
                return in;

        strbuf_setlen(out, orig_out);
        return NULL;
  }

  const char *parse_ccontent(const char *in, struct strbuf *out)
  {
        while (*in && *in != ')') {
                const char *next;

                if ((next = parse_quoted_pair(in, out)) ||
                    (next = parse_comment(in, out)) ||
                    (next = parse_ctext(in, out))) {
                        in = next;
                        continue;
                }
        }

	/*
	 * if "in" is NUL here we have an unclosed comment; but we'll
	 * just silently ignore and accept it
	 */
	return in;
  }

  const char *parse_char(char c, const char *in, struct strbuf *out)
  {
        if (*in != c)
                return NULL;
        strbuf_addch(out, c);
        return in + 1;
  }

You can probably guess at the implementation of parse_quoted_pair(),
parse_ctext(), etc (and naturally, the above is completely untested and
probably has some bugs in it).

In a former life (back when it was still rfc822!) I remember
implementing a similar parser, which I think was in turn based on the
cclient code in pine. It's not _too_ hard to get it all right based on
the BNF in the RFC, but as you can see it's a bit tedious. And I'm not
convinced we actually need it to be completely right for our purposes.
We really are looking for a single address, with the email in "<>" and
the name as everything before that, but de-quoted.

-Peff