Re: Wrong sorting on docker image

Thomas Munro <thomas.munro@xxxxxxxxx> · Sun, 17 Oct 2021 15:30:08 +1300

On Sun, Oct 17, 2021 at 4:42 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Speaking of ICU, if you are using an ICU-enabled Postgres build,
> maybe you could find an ICU collation that acts the way you want.
> This wouldn't be a perfect solution, because we don't yet have
> the ability to set an ICU collation as a database's default.
> But you can attach ICU collations to individual text columns,
> and maybe that would be a good enough workaround.

For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with
glibc on these sort orders, so I suspect this might be coming from
CLDR/UCA/DUCET/ISO 14651 common/synchronised data.  It does look quite
suspicious to me, but I don't know Russian and I'm only speculating
wildly here: it does look as if ё is perhaps getting a lower weight
than it should.  That said, it seems strange that something so basic
should be wrong.  Nosing around in the unicode.org issue tracker, it
seems as though some people might think there is something funny about
Ё (and I wonder if there are/were similar issues with й/Й):

https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22
https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22
(and more)

It's probably not a great idea, but for the record, you can build your
own collation for glibc and other POSIX-oid systems.  For example, see
glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they
previously had customisations on top of the iso14651_t1 file to
reorder a special Ukrainian character in ru_RU, so in theory you could
reorder ё/Ё with a similar local hack and call it ru_RU_X...  I also
wonder if there is some magic switch you can put after an @ symbol on
ICU collations that would change this, perhaps some way to disable the
"contractions" that are potentially implicated here.  Not sure.