Jeff King wrote: > My point is that we don't _know_ what is happening in between the decode > and encode. Does that intermediate form have the information required to > convert back to the exact same bytes as the original form? No, it doesn't. If you want that, save a copy of the string (it's a lazy copy anyway). The module that will let you see into the strings to see what it happening is Devel::Peek. Using that, you will see the state of the UTF8 scalar flag. For example; maia:~$ perl -Mutf8 -MDevel::Peek -le 'Dump "Güt"' SV = PV(0x605d08) at 0x62f230 REFCNT = 1 FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8) PV = 0x60cd20 "G\303\274t"\0 [UTF8 "G\x{fc}t"] CUR = 4 LEN = 8 By default, all strings that are read from files will NOT have this flag set, unless the filehandle that was read from was marked as being utf-8 (in order to preserve C semantics by default); maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'Dump $_' SV = PV(0x6052d0) at 0x604220 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x62f0e0 "G\303\274t"\0 CUR = 4 LEN = 80 maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'BEGIN { binmode STDIN, ":utf8" } Dump $_' SV = PV(0x6052d0) at 0x604220 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x62f100 "G\303\274t"\0 [UTF8 "G\x{fc}t"] CUR = 4 LEN = 80 > But it still feels a little wrong to test by converting. utf8::decode works in-place; it is essentially checking that the string is valid, and if so, marking it as UTF8. my ($encoding); if (utf8::decode($string)) { if (utf8::is_utf($string)) { $encoding = "UTF-8"; } else { $encoding = "US-ASCII"; } } else { $encoding = "ISO8859-1" } For US-ASCII, you'll only have to encode if the string contains special characters (those below \037) or any "=" characters. You could try using langinfo CODESET instead of hardcoding ISO8859-1 like that, but at least on my system can return bizarre values like ANSI_X3.4-1968, which may be in some contexts a "correct" description of the encoding, but is unlikely to be understood by mail clients. > There must be > some way to ask "is this valid utf-8" (there are several candidate > functions, but I don't think either of us quite knows the right way to > invoke them). I think you were just reading the note on the utf8::valid function a little too strongly. You could use this block; if ($string =~ m/[\200-\377]/) { Encode::_utf8_on($string); if (!utf8::valid($string)) { Encode::_utf8_off($string); } } Anyway, I guess all this rubbish is why people use CPAN modules, so that they don't have to continually rediscover every single protocol quirk and reinvent the wheel. ie, it would be much, much simpler to use MIME::Entity->build for all of this, and remove the duplication of code. Sam. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html