"ZHANG, Le" <r0bertz@xxxxxxxxxx> writes: > And the guess_charset() function does not do exactly what > its name says. Really? See below. > @@ -483,32 +497,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) > * Otherwise, we default to assuming it is Latin1 for historical > * reasons. > */ > -static const char *guess_charset(const struct strbuf *line, const char *target_charset) > +static void guess_and_convert_to(struct strbuf *line, const char *to_charset) > { > + if (is_encoding_utf8(to_charset)) { > if (is_utf8(line->buf)) > return; > } > > + convert_to(line, to_charset, "ISO8859-1"); Broken indent. I have to wonder if this helper should be inlined into its single caller, i.e. if (metainfo_charset) { if (is_encoding_utf8(metainfo_charset) && is_utf8(it->buf)) ; /* nothing to be done */ else convert_to(it, metainfo_charset, "ISO8859-1"); } The decode_header() codepath is the only place that we have to handle a possible binary guck without an explicit "this piece of text is in that encoding" available in the message, and we guess by "if the data looks like utf-8, we treat it as such, otherwise we assume it is 8859-1 which was historically popular". All the other codepaths we should know what encoding the incoming data is in. Having said all that, I do not think your patch is correct. - Does the code with your patch work correctly when the incoming data is pre-MIME and does not specify charset, in which case charset.buf may be an empty string? - When the incoming data does not say in what encoding it is in, our intended behaviour is to inspect each line, and if it looks like utf-8 then assume it is in utf-8, otherwise assume it is in 8859-1. And then we convert it to whatever encoding the repository wants, often utf-8 (coming from metainfo_charset, suitable for recording commit log messages). We might want to change this heuristic in the future, but I do not see a need for doing so right now (Cf. b59d398: Do a better job at guessing unknown character sets, 2007-07-17). I do think the guess_and_convert_to() does not implement that intended logic correctly. When we are _not_ encoding to UTF-8, we do not even bother to inspect the data to guess if it is UTF-8. Shouldn't it be more like (modulo "NULL implies utf-8"): if (is_utf8(it->buf)) from_charset = "utf-8"; else from_charset = "ISO8859-1"; if (is_encoding_utf8(metainfo_charset) && !strcmp(from_charset, "utf-8")) ; /* nothing to do */ else if (strcasecmp(to_charset, from_charset)) convert_to(it, metainfo_charset, from_charset); - I don't think the commit message part is handled correctly anymore with your patch. When you want UTF-8 commit log message (metainfo_charset is set to utf-8), and when the incoming data does not have its charset specified, we should be doing the same "guess line-by-line" conversion. You seem to have lost that with this patch, which would be a grave regression. Please apply the attached patch that adds a test at the end of t5100 and make sure the test passes to prevent that regression from happening. Thanks.
Attachment:
1
Description: addition to t5100