Hello Johannes, Johannes Schindelin wrote: > The problem is: you cannot easily recognize if it is UTF8 or not, > programatically. There is a good indicator _against_ UTF8, namely the > first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there > is no _positive_ sign that it is UTF8. For example, many umlauts and other > special modifications to letters, stay in the range 0x7f-0xff. That's not the only indication. Here comes a (Python) function that checks is string s is correctly UTF-8 encoded: def is_utf8_str(s): cnt_furtherbytes = 0 for c in s: if cnt_furtherbytes > 0: if ord(c) & 0xc0 == 0x80: cnt_furtherbytes -= 1 else: return False else: if ord(c) < 0x80: continue elif ord(c) < 0xc0: return False elif ord(c) < 0xe0: cnt_furtherbytes = 1 elif ord(c) < 0xf0: cnt_furtherbytes = 2 elif ord(c) < 0xf8: cnt_furtherbytes = 3 elif ord(c) < 0xfc: cnt_furtherbytes = 4 elif ord(c) < 0xfe: cnt_furtherbytes = 5 else: return False return True An UTF-8 character is either one byte long with the msb 0 or a sequence starting with a value between 0xc0 and 0xfd (inclusive) and depending on that first value up to six further bytes in the range 0x80 to 0xbf. You could even be more strict by checking for Unicode 3.1 conformance (i.e. a character has to be encoded in it's shortest form). Look at utf8(7) for further details. (This manpage is included in the Debian manpages package.) Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=5+choose+3 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html