Re: When is "z" != "z" ?

tedd <tedd@xxxxxxxxxxxx> · Tue, 6 Jun 2006 13:05:00 -0400

Rasmus:

At 6:54 PM -0700 6/5/06, Rasmus Lerdorf wrote:
>tedd wrote:
>>For example, the Unicode issue was raised during this discussion -- if php doesn't consider the numeric relationship of characters, then I see a big problem waiting in the wings. Because if we're having these types of discussions with just considering 00-7F characters, then I can only guess at what's going to happen when we start considering 000000-FFFFFF code-points.
>>
>>Now, was that enough said?  :-)
>
>I don't think you really understand this.  < and > are collation operators when they operate on strings.  They have absolutely nothing to do with the numeric values of the characters.

What's to understand?  It's the pecking order of strings  -- it's the system of how one sorts strings. It's the way I tried to order my books in college. We've been doing it all our life and now you think I don't understand how to sort stuff? It's not the "white and colored" clothes thing my wife keeps talking about, is it?

Look, I understand collation. I also understand that collation is different for different languages and for different needs. In some cases, greatly different.

The point of this discussion was how php collates/sorts or otherwise orders characters/strings when given the operation to increment from "a" to "z".

As this thread has demonstrated, there's a wide range of expectations as to how that should happen.

The reference you provided touches upon some of the problems that collation faces when trying to develop collation systems for different needs. This discussion was no different.

>It just so happens that in English iso-8859-1 there is a 1:1 relationship between the numeric values and the collation order, but you can think of that as dumb luck.

No,  English iso-8859-1 was designed to conform to the ASCII standard-- the same as Unicode and other standards that followed. It's not dumb luck to make "standards" backward compatible, it's by good design.

Considering that you're dealing with English iso-8895 and ASCII (developed by "American" Standard Code), then I think the connection between numeric values and collation went hand-in-hand by design. It was not by accident.

It's just too bad that the "powers-the-be" at the time didn't realize that 7-bit wouldn't cover everything to come in the near future.

>To better understand this, I suggest you start reading here:
>
>  http://icu.sourceforge.net/userguide/Collate_Intro.html
>
>Note one of the points on that page.  That in Lithuanian 'y' falls between 'i' and 'k'.  So even without going into Unicode and just using low-ascii, you have these issues.

I don't have these issues because I'm not Lithuanian. If a Lithuanian php programmer wants "y" to fall between "i" and "k" in a loop, then good luck -- for I can't get it to stop when it passes "z" -- which I think it should.

But, as I am aware, there is no low-ASCII, there is no high-ASCII, there is simply ASCII. While it is common to use the term of extended-ASCII, it's a misnomer because American Standards Association had nothing to do with establishing/defining any character above DEC 126.

The above example you referenced is simply one of many and demonstrates that the "collation" problem is very complex. You should look into how Unicode performs canonical ordering in combining characters such as using an accent, umlaut, or cedilla as well as how that combination affects collation in different languages as stated in your reference. You will see that canonical ordering  algorithm is numeric.

Yes, I'm very aware of Unicode. I'm aware enough to know that they have assigned numerical equivalents to every glyph known to man including those combining glyphs such as those mentioned above to produce combination characters. When I say every "glyph known to man", that includes much more than language specific glyphs.

I'm also aware of IDNS and how they implement Unicode, which is not inclusive. Take for example case mapping which IDNS simply translates all of what they perceive to be uppercase to lowercase. Some characters are combination characters when lowercase and a single character when uppercase, thus there is no lowercase representation for the uppercase character. Oops, I just lost the 16th century (Roller Ball).

Now, I can appreciate the problems facing php considering that it has to deal with not only Unicode, but with also with the IDNS when addressing Unicode and the Internet. But that problem is not going to be solved by ignoring that Unicode code-points have numeric (and other) values. I would think that serious collation systems use numeric values in some fashion in their algorithms -- don't they? If not, please explain how they detect differences between characters and group them into collation tables.

>Now, until we get to PHP 6, we don't have decent Unicode support and we don't have LOCALE-aware operators.  You will have to manually use strcoll() to get them, but that is going to change and you will have the ICU collation algorithms available and for Unicode strings it will be automatic.  You can still have binary-strings if you don't want locale-aware collation, of course.

Well, good luck with that.

tedd
-- 
------------------------------------------------------------------------------------
http://sperling.com  http://ancientstones.com  http://earthstones.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php