At 11:09 AM +0100 10/2/07, Colin Guthrie wrote:
tedd wrote:
Isn't UTF-8 the big fish here?
Sure there' UTF-16 and larger, but everything else is a subset of UTF-8,
is it not?
So, what's the problem if you get a character defined by ISO -- it's
still within the UTF-8 super-group, right?
Individual characters are sometimes OK, but it's the sequence of
characters that could be invalid.
UTF-8 works by using special bits at the MSB end of the byte to say, "I
can't represent this character in one byte, I need to use 2 bytes (or 3
bytes)" (and maybe also 4? can't remember of the top of my head).
In a multi-byte sequence the MSB end of all the bytes must follow a
pre-defined scheme. If they do not they are syntactically invalid UTF-8.
So it's more than just individual characters, the order of them is
important.
Hope that explains it (although probably a bad explanation as I'm very
tired right now!).
Col
Ah, I see what you're saying. I've run into that before when studying
Unicode. The mb_ series of functions deal with larger than ASCII
coding, but I don't know of any that deals with character
sequence/combinations or right/left readings. That's all Greek to me,
pardon the pun.
Cheers,
tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php