Re: languages and PHP

Colin Guthrie <gmane@xxxxxxxxxxxxxx> · Tue, 02 Oct 2007 11:09:06 +0100

tedd wrote:
> Isn't UTF-8 the big fish here?
> 
> Sure there' UTF-16 and larger, but everything else is a subset of UTF-8,
> is it not?
> 
> So, what's the problem if you get a character defined by ISO -- it's
> still within the UTF-8 super-group, right?

Individual characters are sometimes OK, but it's the sequence of
characters that could be invalid.

UTF-8 works by using special bits at the MSB end of the byte to say, "I
can't represent this character in one byte, I need to use 2 bytes (or 3
bytes)" (and maybe also 4? can't remember of the top of my head).

In a multi-byte sequence the MSB end of all the bytes must follow a
pre-defined scheme. If they do not they are syntactically invalid UTF-8.

So it's more than just individual characters, the order of them is
important.

Hope that explains it (although probably a bad explanation as I'm very
tired right now!).

Col

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php