Slow performance of collate "en_US.utf8"

Alexey Borschev <a.borschev@xxxxxxxxxxxxxx> · Mon, 3 Mar 2025 11:47:07 +0300

    Hi!
    Thank everyone for Your answers!
    It is now clear, that it is not PG issue and it will not be fixed
      anytime soon.
    I see that with pure numbers sorting en_US.utf8 is still well
      behind:

      explain (analyze, costs, buffers, verbose) 
select gen.id::text collate "C"   
from  generate_series(10000, 20000) AS gen(id)
order by 1 desc; 
-- 3.5 ms

explain (analyze, costs, buffers, verbose) 
select gen.id::text collate "en_US.utf8"
from  generate_series(10000, 20000) AS gen(id)
order by 1 desc; 
-- 19.8 ms

On the other hand, when I add limit 1, the difference become much less for the reasons I do not understand:

explain (analyze, costs, buffers, verbose) 
select gen.id::text collate "C"   
from  generate_series(10000, 20000) AS gen(id)
order by 1 desc
limit 1; 
-- 1.82 ms

explain (analyze, costs, buffers, verbose) 
select gen.id::text collate "en_US.utf8"
from  generate_series(10000, 20000) AS gen(id)
order by 1 desc
limit 1; 
-- 2.8 ms

    In fact, I've got no database issues right now - just
      benchmarking search speed of b-tree indexes on different columns
      types - int4, int8, numeric, texts, uuids, and run into this
      corner case.

    l hope to make a talk about this on one of the PG conferences
      some day.