On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote: > I think you really should try the UTF-8 guess, since a file may well be UTF-8 > even if the user locale is something else. Especially for XML files, UTF-8 > is common, but there are many more cases. Look into git-gui/po for more > examples. The probability of a UTF-8 test being wrong is just so unimaginable > low. Thinking about this more, I think it is only half the solution. If something is not valid utf-8, then we know it must be something else. But if something is valid utf-8, is it necessarily utf-8? I think we are going to have a much higher probability of guessing wrong there. For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'. But in iso8859-1, they also have meaning (paragraph symbol followed by Ã). Now that is an unlikely combination to come up. And maybe for Latin-1, having two non-ascii characters next to each other is unlikely. But over all commonly used encodings, what is the probability in an average text of that encoding that it contains valid UTF-8? For example, I have no idea what patterns can be found in EUCJP. > > PS Your 'require' is more simply written as 'use I18N::Langinfo > > qw(langinfo CODESET)', or perhaps even simpler: > > See the man page, from which I stole it. It suggests you wrap it all inside > eval {}, just in case your perl does not have langinfo. Yes, that does make sense for a script (I just couldn't see it because the entire toy example would be inside the eval). > As for the is_utf8() i'm not sure what it does, but I can't make it work. There is some magic with how Perl marks strings as "binary" versus "utf-8" that I don't quite understand. And I think is_utf8 is really about asking "is the utf-8 flag set". I think this discussion would benefit greatly from somebody who has more of a clue how perl i18n stuff works. Why don't you work up a patch that makes sense for you, and then hopefully that will get some attention? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html