Re: Slow performance of collate "en_US.utf8"

Thomas Munro <thomas.munro@xxxxxxxxx> · Sat, 1 Mar 2025 11:49:38 +1300

On Sat, Mar 1, 2025 at 9:03 AM Joe Conway <mail@xxxxxxxxxxxxx> wrote:
> On 2/28/25 09:16, Laurenz Albe wrote:
> > On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote:
> >> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.
> >
> > I'd say that you would have to complain to the authors of the
> > GNU C library, which provides this collation.
>
> Yep -- glibc starting with version 2.21 has a massive performance
> regression for certain cases and the glibc folks have basically said
> they will not fix it. If you try the same thing on RHEL 7.x with glibc
> 2.17 it will perform about the same as ICU.

I've idly wondered if this is the culprit, do you know?

https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3

It seems to have bet that strings either differ in primary weight as
early as they do in synthetic natural language tests, or not at all
because they are equal and that is detected with a fast-path binary
comparison.  The average first different character word-to-word in my
/usr/shared/dict/words is at position ~5.6 (some kind of worst case as
it is already sorted), cf 14+ multibyte sequences in OP's example,
which must be well outside their test parameters I would guess.  I
didn't read the code but the description has a miasma of
quadratic-catching-on-fire about it: it's now rescanning the secondary
weights with repeated traversals, because the cache they ripped out
wasn't pulling its own weight at small common prefix sizes, or
something like that?  I wonder if 2.21 also got faster for PostgreSQL
sorting /usr/share/dict/words as you might expect from that
description.  Database keys with long common prefixes *probably*
shouldn't be using natural language sorting anyway, so "don't do
that", but knowledge of collations is not well distributed... on the
other hand I suspect you can dream up some real natural language
examples that lose the bet too: sort "lastname, firstname" across a
country of 300 million, maybe?