Re: Losing my latin on Ordering...

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Tue, 14 Feb 2023 11:23:19 +0100

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.
> 
> We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get
> does not make sense to me. The same prefix can be sorted differently based on
> the suffix apprently, which doesn't make any sense to me.
> 
> Surely sorting should be "constant left-to-right", no? What are we missing?

No, it isn't.  That's not how natural language collations work.

They typically use different levels of comparison: first, strings are sorted
according to base character, ignoring accents, case and punctuation.
Wherever that comparison is equal, the next level is used (typically accents),
then the next (case), and so on.

> I'm already surprised (star) comes before (space), when the latter "comes
> before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

Because they compare identical on the first three levels.  Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

> For now we can work-around this by explicitly adding the `collate "C"` on
> the queries underlying that particular test, but that would be wrong in the
> general case of international strings to sort, so I'd really like to understand
> what's going on.

Yes, it soulds like the "C" collation may be best for you.  That is, if you don't
mind that "Z" < "a".

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com