Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding

Jeff King <peff@xxxxxxxx> · Tue, 27 Sep 2016 01:52:02 -0400

On Mon, Sep 26, 2016 at 06:22:10PM -0700, Junio C Hamano wrote:

> @@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz,
>  
>  	conv = iconv_open(out_encoding, in_encoding);
>  	if (conv == (iconv_t) -1) {
> -		/*
> -		 * Some platforms do not have the variously spelled variants of
> -		 * UTF-8, so let's fall back to trying the most official
> -		 * spelling. We do so only as a fallback in case the platform
> -		 * does understand the user's spelling, but not our official
> -		 * one.
> -		 */
> -		if (is_encoding_utf8(in_encoding))
> -			in_encoding = "UTF-8";
> -		if (is_encoding_utf8(out_encoding))
> -			out_encoding = "UTF-8";
> +		in_encoding = fallback_encoding(in_encoding);
> +		out_encoding = fallback_encoding(out_encoding);
> +

This comment is interesting. We're concerned about a platform knowing
"utf8" but not "UTF-8". When we fallback, we do it for both the input
and output encodings, because we don't know which may have caused the
problem. So is it possible that we improve one case but break the other?

With just UTF-8, I don't think so. That could only be the case with
something like "utf8 -> utf-8" because they both become "UTF-8". So
either it improves the situation or not (because we either understand
UTF-8 or not).

But once we introduce other fallbacks, then "utf8 -> latin1" may become
"UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1"
_could_ work if we turned the knobs individually, but won't if we turn
them both at once. Worse, a system that knows only "UTF-8" and "latin1"
works now, but would break with your patches.

I'm not convinced it's worth worrying about, though. The existence of
such a system is theoretical at this point. I'm not even sure how common
the "know about utf8 but not UTF-8" thing is, or if we were merely being
overly cautious.

-Peff