Re: [PATCH] Fix Q-encoded multi-octet-char split in email.

Jeff King <peff@xxxxxxxx> · Tue, 3 Jul 2012 02:35:11 -0400

On Tue, Jul 03, 2012 at 10:41:37AM +0900, katsu wrote:

> Issue: Email subject written in multi-octet language like japanese cannot
> be displayed in correct at destinations's email client, because the
> Q-encoded subject which is longer than 78 octets is split by a octet not by
> a character at line breaks.
> e.g.)
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8=81=A9?="
>                     |
>                     V
>    "=?utf-8?q? [PATCH] ... =E8=83=86=E8?="
>    "=?utf-8?q?=81=A9=?"
> 
> Changes: Add a judge if a character is an part of utf-8 muti-octet, and
> split the characters by a character not by a octet at line breaks in
> function add_rfc2407() in pretty.c. Like following.
> 
>    "=?utf-8?q? [PATCH] ... =E8=83=86?="
>    "=?utf-8?q?=E8=81=A9=?"
> 
> Signed-off-by: Takeharu Katsuyama <tkatsu.ne@xxxxxxxxx>

Yeah, we definitely don't handle that properly according to the rfc.
This patch is is going in the right direction, but I have a few
comments:

> --- a/pretty.c
> +++ b/pretty.c
> @@ -272,6 +272,12 @@ static void add_rfc2047(struct strbuf *sb, const char *line, int len,
>  	static const int max_length = 78; /* per rfc2822 */
>  	int i;
>  	int line_len;
> +	int utf_ctr, use_utf;
> +
> +	if (!strcmp(encoding, "UTF-8") || !strcmp(encoding, "utf-8"))
> +		use_utf = 1;
> +	else
> +		use_utf = 0;

Please use is_encoding_utf8, which handles both of these spellings, as
well as "utf8" and "UTF8" (it also handles encoding==NULL; I don't think
that can happen in this code path, but it is nice to be defensive).

> @@ -293,10 +299,31 @@ needquote:
>  	strbuf_grow(sb, len * 3 + strlen(encoding) + 100);
>  	strbuf_addf(sb, "=?%s?q?", encoding);
>  	line_len += strlen(encoding) + 5; /* 5 for =??q? */
> +	utf_ctr = 0;
>  	for (i = 0; i < len; i++) {
>  		unsigned ch = line[i] & 0xFF;
>  
> -		if (line_len >= max_length - 2) {
> +		/*
> +		 * Judge if it is an utf-8 char, to avoid inserting newline
> +		 * in the middle of utf-8 char code.
> +		 */
> +		if (use_utf) {
> +			if (ch >= 0xC2 && ch <= 0xDF)	/* 1'st byte of 2-bytes utf-8 */
> +				utf_ctr = 1;
> +			else if (ch >= 0xE0 && ch <= 0xEF)	/*  3-bytes utf-8 */
> +				utf_ctr = 2;
> +			else if (ch >= 0xF0 && ch <= 0xF7)	/*  4-bytes utf-8 */
> +				utf_ctr = 3;
> +			else if (ch >= 0xF8 && ch <= 0xFB)	/*  5-bytes utf-8 */
> +				utf_ctr = 4;
> +			else if (ch >= 0xFC && ch <= 0xFD)	/*  6-bytes utf-8 */
> +				utf_ctr = 5;
> +			else if (ch >= 0x80 && ch <= 0xBF)  /* 2'nd to 6'th byte of utf-8 */
> +				utf_ctr--;
> +			else
> +				utf_ctr = 0;
> +		}
> +		if (line_len >= (max_length - 2 - utf_ctr *3)) {

Can we re-use utf8_width here instead of rewriting these rules?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html