On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote: > My proof is entirely empirical. What happens is that attempting to decode a > non-UTF-8 string will put a unicode surrogate pair into the (now Unicode) > string and encoding will just encode the surrogate pair into UTF-8 and not > the original. As a result, the encode(decode($x)) eq $x *only* if $x is a > valid UTF-8 octet sequence. Why would you not get the original back if > you start with valid UTF-8? Because some UTF-8 sequences have multiple representations, and that information may be lost by whatever intermediate form is the result of decode($x). In practice, I don't know if this happens or not. Though it looks like there is an Encode::is_utf8 function (which is also utf8::is_utf8, but only in perl >= 5.8.1). So we could use that, but it needs the utf-8 flag turned on for the string. Maybe utf8::valid is actually what we want. But there is still a larger question. You have some binary bytes that will go in a subject header. There are non-ascii bytes. There are non-utf8 sequences. What do you do? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html