Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

Jeff King <peff@xxxxxxxx> · Sat, 29 Mar 2008 05:11:45 -0400

On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote:

> My proof is entirely empirical. What happens is that attempting to decode a 
> non-UTF-8 string will put a unicode surrogate pair into the (now Unicode) 
> string and encoding will just encode the surrogate pair into UTF-8 and not 
> the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
> valid UTF-8 octet sequence. Why would you not get the original back if
> you start with valid UTF-8?

Because some UTF-8 sequences have multiple representations, and that
information may be lost by whatever intermediate form is the result of
decode($x). In practice, I don't know if this happens or not.

Though it looks like there is an Encode::is_utf8 function (which is also
utf8::is_utf8, but only in perl >= 5.8.1). So we could use that, but it
needs the utf-8 flag turned on for the string. Maybe utf8::valid is
actually what we want.

But there is still a larger question. You have some binary bytes that
will go in a subject header. There are non-ascii bytes. There are
non-utf8 sequences. What do you do?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html