On Sat, Mar 29, 2008 at 01:54:47PM +0100, Robin Rosenberg wrote: > > There were several given in the "OS X normalize your UTF-8 filenames" > > thread a while back. They generally boil down to "a<UMLAUT MODIFIER>" > > versus "<A WITH UMLAUT>" both of which are valid UTF-8. > > That is what /OS X/ does with file names. It changes one unicode code point > to a sequence of other "equivalent" code points. I'm pretty sure perl does > not do that. My point is that we don't _know_ what is happening in between the decode and encode. Does that intermediate form have the information required to convert back to the exact same bytes as the original form? I don't think you've provided any evidence that it does or does not. But here is some evidence that it does work: $ cat test.pl sub is_valid { my $orig = shift; my $test = $orig; utf8::decode($test); utf8::encode($test); return $orig eq $test ? "yes" : "no"; } print "utf-8: ", is_valid("\xc3\xb6"), "\n"; print "latin-1: ", is_valid("\xc3"), "\n"; print "utf-8 w/ combining: ", is_valid("o\xcc\x88"), "\n"; $ perl test.pl utf-8: yes latin-1: no utf-8 w/ combining: yes But it still feels a little wrong to test by converting. There must be some way to ask "is this valid utf-8" (there are several candidate functions, but I don't think either of us quite knows the right way to invoke them). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html