Re: Losing my latin on Ordering...

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Tue, 14 Feb 2023 12:40:46 +0100

On Tue, 2023-02-14 at 12:17 +0100, Dominique Devienne wrote:
> On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
> > On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > > Surely sorting should be "constant left-to-right", no? What are we missing?
> > 
> > No, it isn't.  That's not how natural language collations work.
> 
> Honestly, who expects the same prefix to sort differently based on what comes
> after, in left-to-right languages?
> How does one even find out what the (capricious?) rules for sorting in a given
> collation are?

Look at the documentation / implementation.

As far as ICU is concerned, here: https://unicode.org/reports/tr10/

> > > I'm already surprised (star) comes before (space), when the latter "comes
> > > before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> > > prefixed pairs are not clustered after sorting is just mistifying to me. So how come?
> > 
> > Because they compare identical on the first three levels.  Any difference in
> > letters, accents or case weighs stronger, even if it occurs to the right
> > of these substrings.
> 
> That's completely unintuitive...

Well, you can complain to GNU and the Unicode consortium, but that's pretty
much the way it is.

> > Yes, it soulds like the "C" collation may be best for you.  That is, if you don't
> > mind that "Z" < "a".
> 
> I would mind if I asked for case-insensitive comparisons.
> 
> So the "C" collation is fine with general UTF-8 encoding?
> I.e. it will be codepoint ordered OK?

Yes, exactly.

Yours,
Laurenz Albe