Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

Sam Vilain <sam@xxxxxxxxxx> · Sun, 30 Mar 2008 16:40:53 +1300

Jeff King wrote:
> My point is that we don't _know_ what is happening in between the decode
> and encode. Does that intermediate form have the information required to
> convert back to the exact same bytes as the original form?

No, it doesn't.  If you want that, save a copy of the string (it's a
lazy copy anyway).

The module that will let you see into the strings to see what it
happening is Devel::Peek.  Using that, you will see the state of the
UTF8 scalar flag.  For example;

 maia:~$ perl -Mutf8 -MDevel::Peek -le 'Dump "Güt"'
 SV = PV(0x605d08) at 0x62f230
   REFCNT = 1
   FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8)
   PV = 0x60cd20 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
   CUR = 4
   LEN = 8

By default, all strings that are read from files will NOT have this flag
set, unless the filehandle that was read from was marked as being utf-8
(in order to preserve C semantics by default);

 maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'Dump $_'
 SV = PV(0x6052d0) at 0x604220
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x62f0e0 "G\303\274t"\0
   CUR = 4
   LEN = 80
 maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'BEGIN { binmode STDIN,
":utf8" } Dump $_'
 SV = PV(0x6052d0) at 0x604220
   REFCNT = 1
   FLAGS = (POK,pPOK,UTF8)
   PV = 0x62f100 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
   CUR = 4
   LEN = 80

> But it still feels a little wrong to test by converting.

utf8::decode works in-place; it is essentially checking that the string
is valid, and if so, marking it as UTF8.

   my ($encoding);
   if (utf8::decode($string)) {
       if (utf8::is_utf($string)) {
           $encoding = "UTF-8";
       }
       else {
           $encoding = "US-ASCII";
       }
   }
   else {
       $encoding = "ISO8859-1"
   }

For US-ASCII, you'll only have to encode if the string contains special
characters (those below \037) or any "=" characters.

You could try using langinfo CODESET instead of hardcoding ISO8859-1
like that, but at least on my system can return bizarre values like
ANSI_X3.4-1968, which may be in some contexts a "correct" description of
the encoding, but is unlikely to be understood by mail clients.

> There must be
> some way to ask "is this valid utf-8" (there are several candidate
> functions, but I don't think either of us quite knows the right way to
> invoke them).

I think you were just reading the note on the utf8::valid function a
little too strongly.

You could use this block;

   if ($string =~ m/[\200-\377]/) {
       Encode::_utf8_on($string);
       if (!utf8::valid($string)) {
           Encode::_utf8_off($string);
       }
   }

Anyway, I guess all this rubbish is why people use CPAN modules, so that
they don't have to continually rediscover every single protocol quirk
and reinvent the wheel.

ie, it would be much, much simpler to use MIME::Entity->build for all of
this, and remove the duplication of code.

Sam.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html