Re: languages and PHP

tedd <tedd@xxxxxxxxxxxx> · Tue, 2 Oct 2007 15:43:17 -0400

At 11:09 AM +0100 10/2/07, Colin Guthrie wrote:
tedd wrote:
 Isn't UTF-8 the big fish here?

 Sure there' UTF-16 and larger, but everything else is a subset of UTF-8,
 is it not?

 So, what's the problem if you get a character defined by ISO -- it's
 still within the UTF-8 super-group, right?

Individual characters are sometimes OK, but it's the sequence of
characters that could be invalid.

UTF-8 works by using special bits at the MSB end of the byte to say, "I
can't represent this character in one byte, I need to use 2 bytes (or 3
bytes)" (and maybe also 4? can't remember of the top of my head).

In a multi-byte sequence the MSB end of all the bytes must follow a
pre-defined scheme. If they do not they are syntactically invalid UTF-8.

So it's more than just individual characters, the order of them is
important.

Hope that explains it (although probably a bad explanation as I'm very
tired right now!).

Col

Ah, I see what you're saying. I've run into that before when studying 
Unicode. The mb_ series of functions deal with larger than ASCII 
coding, but I don't know of any that deals with character 
sequence/combinations or right/left readings. That's all Greek to me, 
pardon the pun.

Cheers,

tedd

--
-------
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php