Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

Robin Rosenberg <robin.rosenberg.lists@xxxxxxxxxx> · Sat, 29 Mar 2008 22:43:40 +0100

Den Saturday 29 March 2008 22.18.49 skrev Jeff King:
> On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:
> > I think you really should try the UTF-8 guess, since a file may well be
> > UTF-8 even if the user locale is something else. Especially for XML
> > files, UTF-8 is common, but there are many more cases. Look into
> > git-gui/po for more examples. The probability of a UTF-8 test being wrong
> > is just so unimaginable low.
>
> Thinking about this more, I think it is only half the solution. If
> something is not valid utf-8, then we know it must be something else.
> But if something is valid utf-8, is it necessarily utf-8? I think we are
> going to have a much higher probability of guessing wrong there.
>
> For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
> But in iso8859-1, they also have meaning (paragraph symbol followed by
> Ã). Now that is an unlikely combination to come up. And maybe for
> Latin-1, having two non-ascii characters next to each other is unlikely.
First that is even by random an unlikely sequence. For any "real" is string
it simply won't happen, even in this context. Try scanning everything you
can think of and see if you find such a sequence that is not actually UTF-8.

> But over all commonly used encodings, what is the probability in an
> average text of that encoding that it contains valid UTF-8?
> For example, I have no idea what patterns can be found in EUCJP.

See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Note that a random string is a randomly generated string. Not a random string
from the set of actually existing strings.

> There is some magic with how Perl marks strings as "binary" versus
> "utf-8" that I don't quite understand. And I think is_utf8 is really
> about asking "is the utf-8 flag set".
>
> I think this discussion would benefit greatly from somebody who has more
> of a clue how perl i18n stuff works. Why don't you work up a patch that
> makes sense for you, and then hopefully that will get some attention?

The only real question as I see it is whether perl has a builtin metod that 
works better than the decode/encode. Anyone?

-- robin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html