Tom writes: > Frequently, stack traces taken without debug symbols are pretty useless, > but you seem to have got lucky. For once, hah! > It seems clear that the unexpected wait is here: > >> (gdb) bt >> #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 >> #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 >> #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 >> #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 >> #4 0x000055f60d36c252 in PGSemaphoreLock () >> #5 0x000055f60d3e35cc in LWLockAcquire () >> #6 0x000055f60d3d14ee in ProcArrayAdd () >> #7 0x000055f60d3e01a3 in InitProcessPhase2 () >> #8 0x000055f60d51c99b in InitPostgres () >> #9 0x000055f60d3f2baf in PostgresMain () >> #10 0x000055f60d102bf8 in ?? () >> #11 0x000055f60d37e492 in PostmasterMain () >> #12 0x000055f60d103fa5 in main () > > which implies that the problem is unexpectedly high contention for the > ProcArrayLock. It's unsurprising that incoming processes would need > that lock, since they need to add themselves to the ProcArray. What's > not so apparent is why there is so much activity as to block them > from doing so for a significant amount of time. We can guess that > the "worker processes" you mentioned are ganging up on that lock > because they need to compute query snapshots, but that still doesn't > seem like enough of an explanation. > > Can you show us exactly what the "worker processes" are doing while > this is going on? One thing I should have mentioned, but forgot, is that the database is configured to do logical replication to another machine - could that have an effect on the lock contention? A colleague pointed out that in the pg_locks output, the replication processes are waiting on the ProcArray lock: · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz I can try turning off replication the next time I see the problem. The workers basically loop on: * SELECT a job from a job-table, UPDATE it * Take an advisory lock on the relevant row in the calculations table * Do some work, at least a couple of SELECTs from various tables * UPDATE or INSERT a row into the calculations table, release the lock * INSERT a row in an exitstatus table * DELETE the job The calculations can be sub-second to several minutes, the majority mostly in the shorter end. As this is happening there are millions of jobs in the job-table. When the problem occurs, the worker logfiles indicate that they are moving along, still working. To give a sense of scale, here are the number of transactions per second plotted: · https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png (I think the hole in the graph is when Munin timed out getting the information, and the number of transactions dropped as I stopped most of the workers.) Best regards, Adam -- "Examination and mastering of a new highly Adam Sjøgren intellectual equipment was a hard labour." asjo@xxxxxxxxxxxx