Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

Jeff King <peff@xxxxxxxx> · Sat, 29 Mar 2008 17:45:16 -0400

On Sat, Mar 29, 2008 at 01:54:47PM +0100, Robin Rosenberg wrote:

> > There were several given in the "OS X normalize your UTF-8 filenames"
> > thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
> > versus "<A WITH UMLAUT>" both of which are valid UTF-8.
> 
> That is what /OS X/ does with file names. It changes one unicode code point
> to a sequence of other "equivalent" code points. I'm pretty sure perl does
> not do that.

My point is that we don't _know_ what is happening in between the decode
and encode. Does that intermediate form have the information required to
convert back to the exact same bytes as the original form? I don't think
you've provided any evidence that it does or does not.

But here is some evidence that it does work:

$ cat test.pl
sub is_valid {
  my $orig = shift;
  my $test = $orig;
  utf8::decode($test);
  utf8::encode($test);
  return $orig eq $test ? "yes" : "no";
}
print "utf-8: ", is_valid("\xc3\xb6"), "\n";
print "latin-1: ", is_valid("\xc3"), "\n";
print "utf-8 w/ combining: ", is_valid("o\xcc\x88"), "\n";

$ perl test.pl
utf-8: yes
latin-1: no
utf-8 w/ combining: yes

But it still feels a little wrong to test by converting. There must be
some way to ask "is this valid utf-8" (there are several candidate
functions, but I don't think either of us quite knows the right way to
invoke them).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html