On Sat, Mar 1, 2025 at 9:03 AM Joe Conway <mail@xxxxxxxxxxxxx> wrote: > On 2/28/25 09:16, Laurenz Albe wrote: > > On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote: > >> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4. > > > > I'd say that you would have to complain to the authors of the > > GNU C library, which provides this collation. > > Yep -- glibc starting with version 2.21 has a massive performance > regression for certain cases and the glibc folks have basically said > they will not fix it. If you try the same thing on RHEL 7.x with glibc > 2.17 it will perform about the same as ICU. I've idly wondered if this is the culprit, do you know? https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3 It seems to have bet that strings either differ in primary weight as early as they do in synthetic natural language tests, or not at all because they are equal and that is detected with a fast-path binary comparison. The average first different character word-to-word in my /usr/shared/dict/words is at position ~5.6 (some kind of worst case as it is already sorted), cf 14+ multibyte sequences in OP's example, which must be well outside their test parameters I would guess. I didn't read the code but the description has a miasma of quadratic-catching-on-fire about it: it's now rescanning the secondary weights with repeated traversals, because the cache they ripped out wasn't pulling its own weight at small common prefix sizes, or something like that? I wonder if 2.21 also got faster for PostgreSQL sorting /usr/share/dict/words as you might expect from that description. Database keys with long common prefixes *probably* shouldn't be using natural language sorting anyway, so "don't do that", but knowledge of collations is not well distributed... on the other hand I suspect you can dream up some real natural language examples that lose the bet too: sort "lastname, firstname" across a country of 300 million, maybe?