Re: many backends hang on MultiXactOffsetSLRU

Alvaro Herrera <alvherre@xxxxxxxxxxxxxx> · Tue, 10 Sep 2024 10:13:51 +0200

On 2024-Sep-10, James Pang wrote:

> Hi experts,
>     we have a Postgresql v14.8 database, almost thousands of backends hang
> on MultiXactOffsetSLRU at the same time, all of these sessions running same
> query "SELECT ....", from OS and postgresql slow log, we found all of these
> query on "BIND" stage.
>  LOG:  duration: 36631.688 ms  bind S_813: SELECT
> LOG:  duration: 36859.786 ms  bind S_1111: SELECT
> LOG:  duration: 35868.148 ms  bind <unnamed>: SELECT
> LOG:  duration: 36906.471 ms  bind <unnamed>: SELECT
> LOG:  duration: 35955.489 ms  bind <unnamed>: SELECT
> LOG:  duration: 36833.510 ms  bind <unnamed>: SELECT
> LOG:  duration: 36839.535 ms  bind S_1219: SELECT
> ...
> 
> this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long
> time.
> 
> could you direct me why they are hanging on 'BIND‘ stage with
> MultiXactOffsetSLRU ?

Very likely, it's related to this problem
[1] https://thebuild.com/blog/2023/01/18/a-foreign-key-pathology-to-avoid/

This is caused by a suboptimal implementation of what we call SLRU,
which multixact uses underneath.  For years, many people dodged this
problem by recompiling with a changed value for
NUM_MULTIXACTOFFSET_BUFFERS in src/include/access/multixact.h (it was
originally 8 buffers, which is very small); you'll need to do that in
all releases up to pg16.  In pg17 this was improved[2] and you'll be
able to change the value in postgresql.conf, though the default already
being larger than the original (16 instead of 8), you may not need to.

[2] https://pgconf.in/files/presentations/2023/Dilip_Kumar_RareExtremelyChallengingPostgresPerformanceProblems.pdf
[3] https://www.pgevents.ca/events/pgconfdev2024/schedule/session/53-problem-in-postgresql-slru-and-how-we-are-optimizing-it/

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"