Re: specify charset for commits

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Johannes,

Johannes Schindelin wrote:
> The problem is: you cannot easily recognize if it is UTF8 or not, 
> programatically. There is a good indicator _against_ UTF8, namely the 
> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
> is no _positive_ sign that it is UTF8. For example, many umlauts and other 
> special modifications to letters, stay in the range 0x7f-0xff.
That's not the only indication.  Here comes a (Python) function that
checks is string s is correctly UTF-8 encoded:

	def is_utf8_str(s):
	  cnt_furtherbytes = 0
	  for c in s:
	    if cnt_furtherbytes > 0:
	      if ord(c) & 0xc0 == 0x80:
		cnt_furtherbytes -= 1
	      else:
		return False
	    else:
	      if ord(c) < 0x80:
		continue
	      elif ord(c) < 0xc0:
	        return False
	      elif ord(c) < 0xe0:
		cnt_furtherbytes = 1
	      elif ord(c) < 0xf0:
		cnt_furtherbytes = 2
	      elif ord(c) < 0xf8:
		cnt_furtherbytes = 3
	      elif ord(c) < 0xfc:
		cnt_furtherbytes = 4
	      elif ord(c) < 0xfe:
		cnt_furtherbytes = 5
	      else:
		return False
	  return True

An UTF-8 character is either one byte long with the msb 0 or a sequence
starting with a value between 0xc0 and 0xfd (inclusive) and depending on
that first value up to six further bytes in the range 0x80 to 0xbf.

You could even be more strict by checking for Unicode 3.1 conformance
(i.e. a character has to be encoded in it's shortest form).

Look at utf8(7) for further details.  (This manpage is included in the
Debian manpages package.)

Best regards
Uwe

-- 
Uwe Kleine-König

http://www.google.com/search?q=5+choose+3
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]