Robin Rosenberg <robin.rosenberg.lists@xxxxxxxxxx> writes: > Den Saturday 29 March 2008 08.22.03 skrev Jeff King: >> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote: >> > Den Friday 28 March 2008 22.29.01 skrev Jeff King: >> > > We always use 'utf-8' as the encoding, since we currently >> > > have no way of getting the information from the user. >> > >> > Don't set encoding to UTF-8 unless it actually looks like UTF-8. >> >> OK. Do you have an example function that guesses with high probability >> whether a string is utf-8? If there are non-ascii characters but we >> _don't_ guess utf-8, what should we do? > > Any test for valid UTF-8 will do that with a very high probability. The > perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling > decode/encode and see if you get the original string works, but that is too > clumsy, IMHO. The sequence to decode followed by encode will test if you have a valid one and if it is canonically encoded, which is testing too much. You only want to check if it is valid, and do not care about normalization. I see this in perluniintro.pod: =item * How Do I Detect Data That's Not Valid In a Particular Encoding? Use the C<Encode> package to try converting it. For example, use Encode 'decode_utf8'; if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) { # valid } else { # invalid } For commit log messages, we traditionally use similar idea to guess by checking if it looks like an UTF-8 encoded string and otherwise assume Latin-1 (and I think we still do if the user does not tell us). If this issue is only about the --compose part of send-email, perhaps you can interactively ask instead of "otherwise assume Latin-1"? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html