Re: substr and UTF-8

Michael B Allen <mba2000@xxxxxxxxxx> · Wed, 30 Aug 2006 10:22:13 -0400

On Wed, 30 Aug 2006 10:08:36 -0400
Michael B Allen <mba2000@xxxxxxxxxx> wrote:

> On Wed, 30 Aug 2006 18:34:20 +0700
> "Peter Lauri" <lists@xxxxxxxxxxx> wrote:
> 
> > Hi group,
> > 
> > I want to limit the number of characters that are shown in a script. The
> > characters happen to be Thai, and the page is encoded in UTF-8. Everything
> > works, except when I want to cut the text (just take start of string).
> > 
> > I do:
> > 
> > echo substr($thaistring, 0, 30);
> > 
> > The beginning of the string works fine, but the last character does mostly
> > "break". How can I determine the start and end of a character.
> 
> The last byte of a UTF-8 character does not have bit 8 set whereas all
> preceeding bytes do.

Actually this is false. I don't know what I was thinking. The high bit
will be set in all bytes of a UTF-8 byte sequence. If it's not it's an
ASCII character.

The bytes are actually layed out as follows [1]:

U-00000000 ___ U-0000007F: 	0xxxxxxx
U-00000080 ___ U-000007FF: 	110xxxxx 10xxxxxx
U-00000800 ___ U-0000FFFF: 	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 ___ U-001FFFFF: 	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So there's no way to tell the last byte of a UTF-8 byte sequence but you
can tell if it's the first byt looking at bits 7 and 8. Specifically,
if bit 8 is not on, the character is ASCII and thus the "start" of a
new character. Otherwise, if bit 7 is on it's the start of a new UTF-8
byte sequence.

  function is_utf8_start($b) {
      return (($b & 0x80) == 0) || ($b & 0x40);
  }

Mike

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

-- 
Michael B Allen
PHP Active Directory SSO
http://www.ioplex.com/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php