Re: [PATCH 1/2] mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset and do conversion alone

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 18 Apr 2011 12:36:22 -0700

"ZHANG, Le" <r0bertz@xxxxxxxxxx> writes:

> And the guess_charset() function does not do exactly what
> its name says.

Really?  See below.

> @@ -483,32 +497,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>   * Otherwise, we default to assuming it is Latin1 for historical
>   * reasons.
>   */
> -static const char *guess_charset(const struct strbuf *line, const char *target_charset)
> +static void guess_and_convert_to(struct strbuf *line, const char *to_charset)
>  {
> +	if (is_encoding_utf8(to_charset)) {
>  		if (is_utf8(line->buf))
>  			return;
>  	}
>  
> +    convert_to(line, to_charset, "ISO8859-1");

Broken indent.

I have to wonder if this helper should be inlined into its single caller,
i.e.

  	if (metainfo_charset) {
		if (is_encoding_utf8(metainfo_charset) && is_utf8(it->buf))
			; /* nothing to be done */
		else
			convert_to(it, metainfo_charset, "ISO8859-1");
	}

The decode_header() codepath is the only place that we have to handle a
possible binary guck without an explicit "this piece of text is in that
encoding" available in the message, and we guess by "if the data looks
like utf-8, we treat it as such, otherwise we assume it is 8859-1 which
was historically popular".  All the other codepaths we should know what
encoding the incoming data is in.

Having said all that, I do not think your patch is correct.

 - Does the code with your patch work correctly when the incoming data is
   pre-MIME and does not specify charset, in which case charset.buf may be
   an empty string?

 - When the incoming data does not say in what encoding it is in, our
   intended behaviour is to inspect each line, and if it looks like utf-8
   then assume it is in utf-8, otherwise assume it is in 8859-1.  And then
   we convert it to whatever encoding the repository wants, often utf-8
   (coming from metainfo_charset, suitable for recording commit log
   messages).

   We might want to change this heuristic in the future, but I do not see
   a need for doing so right now (Cf. b59d398: Do a better job at guessing
   unknown character sets, 2007-07-17).

   I do think the guess_and_convert_to() does not implement that intended
   logic correctly.  When we are _not_ encoding to UTF-8, we do not even
   bother to inspect the data to guess if it is UTF-8.  Shouldn't it be
   more like (modulo "NULL implies utf-8"):

	if (is_utf8(it->buf))
        	from_charset = "utf-8";
	else
        	from_charset = "ISO8859-1";
	if (is_encoding_utf8(metainfo_charset) && !strcmp(from_charset, "utf-8"))
        	; /* nothing to do */
	else if (strcasecmp(to_charset, from_charset))
		convert_to(it, metainfo_charset, from_charset);

 - I don't think the commit message part is handled correctly anymore with
   your patch. When you want UTF-8 commit log message (metainfo_charset is
   set to utf-8), and when the incoming data does not have its charset
   specified, we should be doing the same "guess line-by-line" conversion.
   You seem to have lost that with this patch, which would be a grave
   regression.

Please apply the attached patch that adds a test at the end of t5100 and
make sure the test passes to prevent that regression from happening.

Thanks.

Attachment:
1

Description: addition to t5100