On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.
>
> We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get
> does not make sense to me. The same prefix can be sorted differently based on
> the suffix apprently, which doesn't make any sense to me.
>
> Surely sorting should be "constant left-to-right", no? What are we missing?
No, it isn't. That's not how natural language collations work.
Honestly, who expects the same prefix to sort differently based on what comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a given collation are?
I'm aware of sorting taking numerical numbers in text influencing sort, so "Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no logical relatioship, like 9 and 10 do.
> I'm already surprised (star) comes before (space), when the latter "comes
> before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> prefixed pairs are not clustered after sorting is just mistifying to me. So how come?
Because they compare identical on the first three levels. Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.
That's completely unintuitive...
> For now we can work-around this by explicitly adding the `collate "C"` on
> the queries underlying that particular test, but that would be wrong in the
> general case of international strings to sort, so I'd really like to understand
> what's going on.
Yes, it soulds like the "C" collation may be best for you. That is, if you don't
mind that "Z" < "a".
I would mind if I asked for case-insensitive comparisons.
So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?