Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Den Saturday 29 March 2008 09.49.48 skrev Jeff King:
> On Sat, Mar 29, 2008 at 09:41:53AM +0100, Robin Rosenberg wrote:
> > > OK. Do you have an example function that guesses with high probability
> > > whether a string is utf-8? If there are non-ascii characters but we
> > > _don't_ guess utf-8, what should we do?
> >
> > Any test for valid UTF-8 will do that with a very high probability. The
> > perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
> > decode/encode and see if you get the original string works, but that is
> > too clumsy, IMHO.
>
> Does that work? I would think you would have to compare the normalized
> versions of each string, since decode(encode($x)) is not, AIUI,
> guaranteed to produce $x.

I don't claim to understand it either. Hopefully some perl guru will step 
forward and just explain how to do this in perl.

My proof is entirely empirical. What happens is that attempting to decode a 
non-UTF-8 string will put a unicode surrogate pair into the (now Unicode) 
string and encoding will just encode the surrogate pair into UTF-8 and not 
the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
valid UTF-8 octet sequence. Why would you not get the original back if
you start with valid UTF-8?

-- robin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux