=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@xxxxxxxxxxxx> writes: > Tom writes: >> ... which implies that the problem is unexpectedly high contention for the >> ProcArrayLock. > One thing I should have mentioned, but forgot, is that the database is > configured to do logical replication to another machine - could that > have an effect on the lock contention? > A colleague pointed out that in the pg_locks output, the replication > processes are waiting on the ProcArray lock: > · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz Yeah, that is *mighty* interesting. For the archives' sake, I'll point out that this shows 16 walsender processes, of which 6 are in WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10 are waiting for the ProcArrayLock. There are also three backends in ProcArrayGroupUpdate state, confirming that we have a lot of contention for the ProcArrayLock ... but interestingly, no non-walsender processes appear to be blocked directly on the ProcArrayLock. In particular, I *don't* see a backend that could plausibly be the leader of that group commit; the other sessions showing non-SELECT activity are waiting for assorted heavyweight locks, and they all look like they're still running their transactions, not trying to commit. Perhaps this just reflects the fact that pg_stat_activity isn't an instantaneous snapshot, though? It's also evident that there's some parallel query activity going on, as some sessions are in BgWorkerShutdown waits; so parallel worker entry/exit might be contributing to the overall load on ProcArrayLock. Anyway, we still don't know more than that the problem is ProcArrayLock contention; it seems that the walsenders are involved, but we don't really know if they're causing the problem or are just more victims. One thing I'm curious about is whether you have walsenders starting or stopping on a regular basis? The ones visible here seem to all have been running for at least an hour or so, but in a quick look through the code it seemed like much of the walsender code that would have any reason to take the ProcArrayLock is in walsender startup or shutdown. Another angle that is worth a try is to see whether you can reduce the number of walsenders, and then see if that ameliorates the issue. I'm not totally sure but I think we support cascaded replication, so that you could reduce the replication load on the primary just by reconfiguring some of the standbys to pull from other ones not directly from the primary. regards, tom lane